ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering

Amiangshu Bosu; Anindya Iqbal; Jaydeb Sarker; Md Awsaf Alam Anindya; Showvik Biswas

arxiv: 2604.14408 · v1 · submitted 2026-04-15 · 💻 cs.SE

ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering

Md Awsaf Alam Anindya , Showvik Biswas , Anindya Iqbal , Jaydeb Sarker , Amiangshu Bosu This is my paper

Pith reviewed 2026-05-10 12:38 UTC · model grok-4.3

classification 💻 cs.SE

keywords toxicity detectioncode reviewslarge language modelsbrowser extensionsoftware engineeringconstructive communicationGitHub pull requests

0 comments

The pith

A browser extension detects toxic code review comments and generates constructive rewrites in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxiShield, a GitHub-integrated tool that identifies harmful language in pull request discussions, explains the issues, and suggests revised versions that preserve meaning while sounding more supportive. It builds this from separate models for detection, categorization, and reframing, then checks the results with both performance numbers and a small user study. A reader would care because negative exchanges in code reviews can discourage contributors and slow collaboration in open source projects. If the approach holds, everyday development environments could start actively guiding people toward clearer, less abrasive feedback.

Core claim

ToxiShield provides a three-part system for GitHub pull requests: a BERT model that flags toxic text, an LLM that assigns fine-grained toxicity categories with explanations, and a fine-tuned Llama model that rewrites toxic comments into constructive alternatives. The authors select the best-performing versions of each component after training on thousands of real code review samples and confirm the overall tool through automated accuracy metrics plus a Technology Acceptance Model survey with ten participants.

What carries the argument

The Reframer module, which takes toxic code review text and produces a revised version that maintains technical content while shifting to a non-toxic, fluent style through fine-tuning on labeled examples.

If this is right

Developers receive immediate, in-context suggestions that let them revise comments before posting, lowering the chance of escalating conflicts.
The categorization step supplies concrete reasons for flagged text, giving users repeated practice in recognizing and avoiding toxic patterns.
Wider use in open source repositories could increase retention of contributors who might otherwise leave after negative review experiences.
The browser-extension format shows a practical path for adding similar support features directly into existing collaboration platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detection and rewriting pipeline could be adapted to other developer channels such as commit messages or issue threads without major redesign.
Over time, repeated exposure to reframed examples might gradually change default communication habits across entire teams or projects.
Pairing the reframer with static analysis tools could add a check that rewritten comments still accurately describe the technical point under discussion.

Load-bearing premise

The models will keep identifying toxicity correctly and producing useful rewrites when they encounter new comments from projects and developers outside the original training data.

What would settle it

Apply the full ToxiShield pipeline to several hundred recent GitHub pull request comments never seen during training and measure whether human reviewers agree with its toxicity flags and rephrasings at rates substantially above random chance.

Figures

Figures reproduced from arXiv: 2604.14408 by Amiangshu Bosu, Anindya Iqbal, Jaydeb Sarker, Md Awsaf Alam Anindya, Showvik Biswas.

**Figure 1.** Figure 1: Motivational Workflow of ToxiShield research indicates that communication within OSS communities frequently becomes unprofessional and toxic [10, 21, 28, 39, 43–45]. This hostility produces severe consequences. It creates significant barriers for new participants [39] and causes active contributors to leave the project [43]. Furthermore, toxic interactions disproportionately affect minority and junior dev… view at source ↗

**Figure 2.** Figure 2: Workflow of the Toxicity Filter Module where 15 M PR comments are selected from [44]. often prioritize offline classification accuracy over the latency constraints required for seamless integration into real-time developer workflows. For instance, the current state-of-the-art tool, ToxiCR [45], achieves strong performance metrics (88.9% F1-score, 95.8% accuracy); however, recent studies indicate that its e… view at source ↗

**Figure 3.** Figure 3: Communication Coach: Iterative Prompt Evolution [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion Matrix for Communication Coach, where SD represents Self Deprecation, Identity Attack [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: The Reframer: A teacher model is used to create a parallel dataset, which is used to train a student detoxification model with iterative prompting. model reliably detects general toxicity and generates logical rationales, it occasionally struggles to resolve the blurred boundaries between closely related interpersonal toxicities. Key Finding 2: Claude 3.5 Sonnet achieves best performance on multiclass toxi… view at source ↗

**Figure 6.** Figure 6: Browser Extension to integrate ToxiShield with GitHub-based pull request reviews [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Toxic interactions during code reviews can undermine teamwork and hinder productivity in software engineering (SE) teams. While prior studies explore toxicity detection and empirical investigation, they lack real-time detoxification tools to support the SE community. To address this gap, we present ToxiShield, a browser extension for GitHub pull requests that is built using three modules: i) Toxicity Filter -- to identify whether a text is toxic, ii) Communication coach -- to facilitate just-in-time fine-grained toxicity categorization with explanations, and iii) The Reframer -- that generates a revised, constructive alternative of a toxic text. For each module, we trained and evaluated multiple deep learning and Large Language Models (LLMs) to identify the best choice. A BERT-based binary detection model, trained on 38,761 code review samples, achieves 98% accuracy and an F1-score of 97% and is the selected one for the Toxicity Filter module. For the Communication Coach, prompt-tuned Claude 3.5 Sonnet achieved the best performance with 39% MCC and 42% F1 in multiclass toxicity classification with detailed reasoning. For Reframer, we evaluated five LLMs using a fine-tuning strategy on a dataset of 10,120 code review comments. The fine-tuned Llama 3.2 model achieves 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. We further validated ToxiShield through a human evaluation using the Technology Acceptance Model with 10 participants, confirming its perceived usefulness and ease of adoption. ToxiShield sets a benchmark for advancing constructive communication in software engineering, driving inclusivity and healthier collaboration in open-source communities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToxiShield builds a usable GitHub extension for toxicity detection and reframing on code reviews but all its performance numbers come from internal data splits with no live out-of-distribution testing.

read the letter

The main takeaway is that this paper delivers a working browser extension that combines toxicity filtering, category explanations, and constructive rewriting for GitHub pull requests, trained and tested on software engineering data. It is an applied integration rather than a new algorithm, but the end-to-end system with model selection and a small user study is concrete enough to be worth looking at. They picked a BERT detector that hits 98% accuracy and 97% F1 on 38k code review samples, used prompt-tuned Claude for multiclass categorization, and fine-tuned Llama 3.2 for reframing on 10k samples with 95% style transfer accuracy and 84% J-score. The Technology Acceptance Model study with 10 participants showed positive perceived usefulness. That combination of a live tool plus SE-specific numbers is what is actually new here. Prior toxicity work in code reviews exists, but packaging it as an extension with these three modules and reporting the specific model choices is the practical step forward. The paper does a reasonable job of comparing several models per module and giving clear metrics for each. The detection numbers look strong on their data, and the reframing preserves content while shifting tone. The soft spots are mostly around evaluation. All reported results use internal train-test splits from the collected corpora; there is no separate test set drawn from current, heterogeneous GitHub threads across projects, languages, or time periods. Toxicity and good rewrites are context-sensitive, so performance on curated data does not automatically carry over to the deployment setting the extension targets. The communication coach module only reaches 39% MCC, which is modest for multiclass work. The human study is also small. This is for tool builders or SE researchers who want to improve inclusivity in open-source reviews or who need baseline numbers on toxicity models for code comments. A reader looking for a ready-to-adapt system would find the model selections and dataset sizes useful. I would send it to peer review. The implementation is real and the idea addresses a practical pain point, even though stronger generalization tests and a larger user study would strengthen it.

Referee Report

2 major / 2 minor

Summary. The paper introduces ToxiShield, a browser extension for GitHub pull requests comprising three modules: a BERT-based Toxicity Filter (98% accuracy, 97% F1 on 38,761 code review samples), a prompt-tuned Claude 3.5 Sonnet Communication Coach (39% MCC, 42% F1 for multiclass categorization), and a fine-tuned Llama 3.2 Reframer (95.27% style transfer accuracy, 84% J-score on 10,120 samples). It also reports a Technology Acceptance Model human evaluation with 10 participants confirming perceived usefulness and ease of use. The work positions the tool as a benchmark for real-time detoxification to promote inclusive developer communication.

Significance. If the reported performance generalizes, ToxiShield offers a practical, integrated system for toxicity detection, explanation, and constructive reframing directly in code review workflows, addressing a gap in SE tooling. The concrete dataset sizes, multi-model evaluation, and small-scale TAM validation provide a starting point for deployment-oriented research. The combination of filter + coach + reframer is a strength, but the absence of out-of-distribution testing on live GitHub data limits claims about real-world effectiveness.

major comments (2)

[Abstract] Abstract and Evaluation sections: All performance figures (BERT 98% accuracy/97% F1, Claude 39% MCC, Llama 95.27% style-transfer/84% J-score) are obtained from internal train/test splits of the curated corpora. No baseline comparisons (e.g., against prior toxicity detectors or simpler classifiers), statistical significance tests, confidence intervals, or details on annotation quality and data splits are provided, making it impossible to judge whether the selected models are meaningfully superior or reliable.
[Evaluation] Deployment and Evaluation sections: The central claim that ToxiShield enables effective real-time filtering, categorization, and reframing inside actual GitHub pull requests is not supported by any held-out evaluation on heterogeneous, temporally current GitHub threads (different projects, languages, reviewer styles, or temporal drift). Toxicity labels and constructive rewrites are context-sensitive; performance on the paper's internal distributions does not establish behavior on the deployment distribution targeted by the browser extension.

minor comments (2)

[Abstract] Abstract: The phrase 'sets a benchmark' is not justified by the reported experiments; consider rephrasing to 'provides an initial integrated prototype' or similar.
[Human Evaluation] The human evaluation description (10 participants, TAM) lacks details on participant recruitment, task instructions, and statistical analysis of the survey responses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of evaluation rigor and real-world applicability. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation sections: All performance figures (BERT 98% accuracy/97% F1, Claude 39% MCC, Llama 95.27% style-transfer/84% J-score) are obtained from internal train/test splits of the curated corpora. No baseline comparisons (e.g., against prior toxicity detectors or simpler classifiers), statistical significance tests, confidence intervals, or details on annotation quality and data splits are provided, making it impossible to judge whether the selected models are meaningfully superior or reliable.

Authors: We agree that the current evaluation would be strengthened by explicit baseline comparisons, statistical significance testing, confidence intervals, and fuller details on annotation processes and data splits. In the revised manuscript we will add (1) comparisons against simpler classifiers (e.g., logistic regression on TF-IDF) and representative prior toxicity detectors, (2) McNemar or paired t-tests with p-values, (3) 95% confidence intervals via bootstrapping, and (4) expanded description of the annotation protocol, inter-annotator agreement, and train/validation/test split ratios. These additions will allow readers to better assess model reliability and superiority. revision: yes
Referee: [Evaluation] Deployment and Evaluation sections: The central claim that ToxiShield enables effective real-time filtering, categorization, and reframing inside actual GitHub pull requests is not supported by any held-out evaluation on heterogeneous, temporally current GitHub threads (different projects, languages, reviewer styles, or temporal drift). Toxicity labels and constructive rewrites are context-sensitive; performance on the paper's internal distributions does not establish behavior on the deployment distribution targeted by the browser extension.

Authors: We acknowledge that the reported metrics derive from curated, internal train/test splits rather than live, temporally current GitHub threads spanning diverse projects and reviewer styles. The datasets themselves consist of real code-review comments, and the 10-participant TAM study provides evidence of perceived usefulness in a realistic usage scenario. In revision we will (1) add an explicit limitations subsection discussing the gap between curated and live distributions, (2) report any available qualitative observations from the TAM sessions that touched on context sensitivity, and (3) outline concrete plans for future deployment studies. We cannot, however, retroactively supply new held-out live-GitHub results within the current revision cycle. revision: partial

Circularity Check

0 steps flagged

No circularity: standard empirical ML training and evaluation

full rationale

The paper presents an applied system built from three independently trained models (BERT binary classifier on 38,761 samples, prompt-tuned Claude for multiclass, fine-tuned Llama 3.2 on 10,120 samples) whose performance numbers are obtained via conventional train/test splits and a 10-participant TAM study. No equations, derivations, uniqueness theorems, or self-citations are invoked to justify the central claims; each reported metric follows directly from the training procedure and external human ratings without reducing to a definitional loop or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and the premise that toxicity can be reliably labeled and reframed without loss of technical meaning; no new physical or mathematical entities are introduced.

pith-pipeline@v0.9.0 · 5634 in / 1113 out tokens · 28626 ms · 2026-05-10T12:38:03.709235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

2026.ToxiShield: Replication Package

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU-SEAL/ToxiShield

work page 2026
[2]

Sarthak Bharadwaj, Fabio Santos, and Bianca Trinkenreich. 2025. The Shifting Sands of Toxicity: The Evolving Nature of Interpersonal Challenges in Open Source. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11. doi:10.1109/ESEM64174.2025.00016

work page doi:10.1109/esem64174.2025.00016 2025
[3]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative research in psychology3, 2 (2006), 77–101

work page 2006
[4]

Kevin Daniel André Carillo, Josianne Marsan, and Bogdan Negoita. 2016. Towards developing a theory of toxicity in the context of free/open source software & peer production communities.SIGOPEN 2016(2016)

work page 2016
[5]

Ofer Dekel and Ohad Shamir. 2010. Multiclass-multilabel classification with more classes than examples. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 137–144

work page 2010
[6]

Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12. doi:10.1109/ESEM64174.2025.00059

work page doi:10.1109/esem64174.2025.00059 2025
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...

work page doi:10.18653/v1/n19-1423 2019
[8]

Carolyn D Egelman, Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. 2020. Predicting developers’ negative feelings about code review. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 174–185. doi:10.1145/3377811.3380414

work page doi:10.1145/3377811.3380414 2020
[9]

Ramtin Ehsani, Mia Mohammad Imran, Robert Zita, Kostadin Damevski, and Preetha Chatterjee. 2024. Incivility in open source projects: A comprehensive annotated dataset of locked github issue threads. InProceedings of the 21st International Conference on Mining Software Repositories. 515–519. doi:10.1145/3643991.3644887

work page doi:10.1145/3643991.3644887 2024
[10]

Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2023. Exploring moral principles exhibited in oss: A case study on github heated issues. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2092–2096

work page 2023
[11]

Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2025. Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory. (2025), 1–8. doi:10.1109/NLBSE66842.2025. 00006

work page doi:10.1109/nlbse66842.2025 2025
[12]

Laura Faulkner. 2003. Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.Behavior Research Methods, Instruments, & Computers35, 3 (2003), 379–383. doi:10.3758/BF03195514

work page doi:10.3758/bf03195514 2003
[13]

Shut the f**k up

Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The" shut the f** k up" phenomenon: Characterizing incivility in open source code review discussions.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–35. doi:10.1145/3479497

work page doi:10.1145/3479497 2021
[14]

Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. 2024. Incivility detection in open source code review and issue discussions.Journal of Systems and Software209 (2024), 111935. doi:10.1016/j.jss.2023.111935

work page doi:10.1016/j.jss.2023.111935 2024
[15]

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32. doi:10.1609/aaai.v32i1.11330

work page doi:10.1609/aaai.v32i1.11330 2018
[16]

Shai Gretz, Alon Halfon, Ilya Shnayderman, Orith Toledo-Ronen, Artem Spector, Lena Dankin, Yannis Katsis, Ofir Arviv, Yoav Katz, Noam Slonim, and Liat Ein-Dor. 2023. Zero-shot Topical Text Classification with LLMs - an Experimental Study. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 96...

work page doi:10.18653/v1/2023.findings-emnlp.647 2023
[17]

Sanuri Dananja Gunawardena, Peter Devine, Isabelle Beaumont, Lola Piper Garden, Emerson Murphy-Hill, and Kelly Blincoe. 2022. Destructive criticism in software code review impacts inclusion.Proceedings of the ACM on Human-Computer Interaction6, CSCW2 (2022), 1–29. doi:10.1145/3555183

work page doi:10.1145/3555183 2022
[18]

Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, and Hongxin Hu. 2023. An Investigation of Large Language Models for Real-World Hate Speech Detection. In2023 International Conference on Machine Learning and Applications (ICMLA). 1568–1573. doi:10.1109/ICMLA58977.2023.00237

work page doi:10.1109/icmla58977.2023.00237 2023
[19]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

work page
[21]

InInternational Conference on Learning Representations

LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. FSE123:22 Anindyaet al

work page 2026
[22]

Mia Mohammad Imran, Robert Zita, Rahat Rizvi Rahman, Preetha Chatterjee, and Kostadin Damevski. 2026. Toxicity Ahead: Forecasting Conversational Derailment on GitHub. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26)(Rio de Janeiro, Brazil). ACM, Rio de Janeiro, Brazil. doi:10.1145/3744916. 3787839

work page doi:10.1145/3744916 2026
[23]

Satyanarayana Chowdary Kadiyala, Jaydeb Sarker, and Bianca Trinkenreich. 2026. How should self-deprecation comments be classified? A toxicity analysis study on Zephyr. InProceedings of the 5th ACM/IEEE International Workshop on NL-based Software Engineering (NLBSE). doi:10.1145/3786164.3788447

work page doi:10.1145/3786164.3788447 2026
[24]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and when llm-based assistants can go wrong: Investigating the effectiveness of prompt-based interactions for software help-seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. 288–303. doi:10.1145/3640543.3645200

work page doi:10.1145/3640543.3645200 2024
[25]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

work page 2022
[26]

Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing Toxic Content Classification for a Diversity of Perspectives. InSeventeenth Symposium on Usable Privacy and Security (SOUPS 2021). USENIX Association, 299–318. https://www.usenix.org/ conference/soups2021/prese...

work page 2021
[27]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics 33, 1 (March 1977), 159. doi:10.2307/2529310

work page doi:10.2307/2529310 1977
[28]

2005.The Technology Acceptance Model

Qingxiong Ma and Liping Liu. 2005.The Technology Acceptance Model. doi:10.4018/9781591404743.ch006.ch000

work page doi:10.4018/9781591404743.ch006.ch000 2005
[29]

Hassan, and Xiaohu Yang

Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian Kästner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722. doi:10.1145/3510003.3510111

work page doi:10.1145/3510003.3510111 2022
[30]

Shyamal Mishra and Preetha Chatterjee. 2024. Exploring ChatGPT for Toxicity Detection in GitHub(ICSE-NIER’24). Association for Computing Machinery, New York, NY, USA, 6–10. doi:10.1145/3639476.3639777

work page doi:10.1145/3639476.3639777 2024
[31]

Subhabrata Mukherjee, Ahmed Hassan Awadallah, and Jianfeng Gao. 2021. XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation. doi:10.48550/ARXIV.2106.04563

work page doi:10.48550/arxiv.2106.04563 2021
[32]

Emerson Murphy-Hill, Jill Dicker, Delphine Carlson, Marian Harbach, Ambar Murillo, and Tao Zhou. 2024. Did Gerrit’s Respectful Code Review Reminders Reduce Comment Toxicity? InEquity, Diversity, and Inclusion in Software Engineering: Best Practices and Insights. Apress Berkeley, CA, 309–321. doi:10.1007/978-1-4842-9651-6_18

work page doi:10.1007/978-1-4842-9651-6_18 2024
[33]

Emerson Murphy-Hill, Ciera Jaspan, Carolyn Egelman, and Lan Cheng. 2022. The pushback effects of race, ethnicity, gender, and age in code review.Commun. ACM65, 3 (2022), 52–57. doi:10.1145/3474097

work page doi:10.1145/3474097 2022
[34]

Nicole Novielli, Fabio Calefato, Filippo Lanubile, and Alexander Serebrenik. 2021. Assessment of off-the-shelf SE- specific sentiment analysis tools: An extended replication study.Empirical Software Engineering26, 4 (2021), 77. doi:10.1007/s10664-021-09960-w

work page doi:10.1007/s10664-021-09960-w 2021
[35]

Phil Sidney Ostheimer, Mayank Kumar Nagda, Marius Kloft, and Sophie Fellenz. 2024. Text Style Transfer Evaluation Using Large Language Models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 15802–15822. https:// aclanthology.org/202...

work page 2024
[36]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

work page 2002
[37]

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style Transfer Through Back- Translation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 866–876. doi:...

work page doi:10.18653/v1/p18-1080 2018
[38]

Huilian Sophie Qiu, Bogdan Vasilescu, Christian Kästner, Carolyn Egelman, Ciera Jaspan, and Emerson Murphy-Hill

work page
[39]

Gerosa, and Igor Steinmacher

Detecting interpersonal conflict in issues and code review: cross pollinating open- and closed-source approaches. InProceedings of the 2022 ACM/IEEE 44th International Conference on Software Engineering: Software Engineering in Society(Pittsburgh, Pennsylvania)(ICSE-SEIS ’22). Association for Computing Machinery, New York, NY, USA, 41–55. doi:10.1145/3510...

work page doi:10.1145/3510458.3513019 2022
[40]

Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion.Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10.1145/3660780

work page doi:10.1145/3660780 2024
[41]

Naveen Raman, Minxuan Cao, Yulia Tsvetkov, Christian Kästner, and Bogdan Vasilescu. 2020. Stress and burnout in open source: Toward finding, understanding, and mitigating unhealthy interactions. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 57–60. doi:10.1145/3377816.3381732

work page doi:10.1145/3377816.3381732 2020
[42]

Sarthak Roy, Ashish Harshvardhan, Animesh Mukherjee, and Punyajoy Saha. 2023. Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of the Association for Computational Linguistics: EMNLP 2023. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. ToxiShield: Promoting Inclusive Developer Communic...

work page doi:10.18653/v1/2023.findings-emnlp.407 2023
[43]

Elvis Saravia. 2022. Prompt Engineering Guide.https://github.com/dair-ai/Prompt-Engineering-Guide(12 2022)

work page 2022
[44]

Jaydeb Sarker, Sayma Sultana, Steven R Wilson, and Amiangshu Bosu. 2023. ToxiSpanSE: An explainable toxicity detection in code review comments. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12. doi:10.1109/ESEM56168.2023.10304855

work page doi:10.1109/esem56168.2023.10304855 2023
[45]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2020. A benchmark study of the contemporary toxicity detectors on software engineering interactions. In2020 27th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 218–227. doi:10.1109/APSEC51365.2020.00030

work page doi:10.1109/apsec51365.2020.00030 2020
[46]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

work page doi:10.1145/3715744 2025
[47]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Automated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10. 1145/3583562

work page 2023
[48]

Sulayman K Sowe, Ioannis Stamelos, and Lefteris Angelis. 2008. Understanding knowledge sharing activities in free/open source software projects: An empirical study.Journal of Systems and Software81, 3 (2008), 431–446. doi:10.1016/j.jss.2007.03.086

work page doi:10.1016/j.jss.2007.03.086 2008
[49]

Sayma Sultana, Jaydeb Sarker, Farzana Israt, Rajshakhar Paul, and Amiangshu Bosu. 2025. Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments.ACM Transactions on Software Engineering Methodology (TOSEM)34 (2025). doi:10.1145/3757739

work page doi:10.1145/3757739 2025
[50]

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. 2023. Text Classification via Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023. 8990–9005. doi:10.18653/v1/2023.findings-emnlp.603

work page doi:10.18653/v1/2023.findings-emnlp.603 2023
[51]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Bianca Trinkenreich, Mariam Guizani, Igor Wiese, Tayana Conte, Marco Gerosa, Anita Sarma, and Igor Steinmacher

work page
[53]

doi:10.1109/TSE.2021.3108032

Pots of gold at the end of the rainbow: what is success for open source contributors?IEEE Transactions on Software Engineering48, 10 (2021), 3940–3953. doi:10.1109/TSE.2021.3108032

work page doi:10.1109/tse.2021.3108032 2021
[54]

Robert A Virzi. 1992. Refining the test phase of usability evaluation: how many subjects is enough?Human factors34, 4 (1992), 457–468

work page 1992
[55]

Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2023. Large Language Models Are Zero-Shot Text Classifiers. doi:10. 48550/ARXIV.2312.01044

work page arXiv 2023
[56]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics7 (09 2019), 625–641. doi:10.1162/tacl_a_00290

work page doi:10.1162/tacl_a_00290 2019
[57]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou

work page
[58]

InAdvances in Neural Information Processing Systems, Vol

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/ 2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

work page 2022
[59]

John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy,...

work page doi:10.18653/v1/p19-1427 2019
[60]

Frances Zlotnick. 2017. GitHub Open Source Survey 2017. http://opensourcesurvey.org/2017/. doi:10.5281/zenodo. 806811 Received 2025-09-10; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026

work page doi:10.5281/zenodo 2017

[1] [1]

2026.ToxiShield: Replication Package

Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU-SEAL/ToxiShield

work page 2026

[2] [2]

Sarthak Bharadwaj, Fabio Santos, and Bianca Trinkenreich. 2025. The Shifting Sands of Toxicity: The Evolving Nature of Interpersonal Challenges in Open Source. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11. doi:10.1109/ESEM64174.2025.00016

work page doi:10.1109/esem64174.2025.00016 2025

[3] [3]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative research in psychology3, 2 (2006), 77–101

work page 2006

[4] [4]

Kevin Daniel André Carillo, Josianne Marsan, and Bogdan Negoita. 2016. Towards developing a theory of toxicity in the context of free/open source software & peer production communities.SIGOPEN 2016(2016)

work page 2016

[5] [5]

Ofer Dekel and Ohad Shamir. 2010. Multiclass-multilabel classification with more classes than examples. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 137–144

work page 2010

[6] [6]

Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12. doi:10.1109/ESEM64174.2025.00059

work page doi:10.1109/esem64174.2025.00059 2025

[7] [7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...

work page doi:10.18653/v1/n19-1423 2019

[8] [8]

Carolyn D Egelman, Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. 2020. Predicting developers’ negative feelings about code review. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 174–185. doi:10.1145/3377811.3380414

work page doi:10.1145/3377811.3380414 2020

[9] [9]

Ramtin Ehsani, Mia Mohammad Imran, Robert Zita, Kostadin Damevski, and Preetha Chatterjee. 2024. Incivility in open source projects: A comprehensive annotated dataset of locked github issue threads. InProceedings of the 21st International Conference on Mining Software Repositories. 515–519. doi:10.1145/3643991.3644887

work page doi:10.1145/3643991.3644887 2024

[10] [10]

Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2023. Exploring moral principles exhibited in oss: A case study on github heated issues. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2092–2096

work page 2023

[11] [11]

Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2025. Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory. (2025), 1–8. doi:10.1109/NLBSE66842.2025. 00006

work page doi:10.1109/nlbse66842.2025 2025

[12] [12]

Laura Faulkner. 2003. Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.Behavior Research Methods, Instruments, & Computers35, 3 (2003), 379–383. doi:10.3758/BF03195514

work page doi:10.3758/bf03195514 2003

[13] [13]

Shut the f**k up

Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The" shut the f** k up" phenomenon: Characterizing incivility in open source code review discussions.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–35. doi:10.1145/3479497

work page doi:10.1145/3479497 2021

[14] [14]

Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. 2024. Incivility detection in open source code review and issue discussions.Journal of Systems and Software209 (2024), 111935. doi:10.1016/j.jss.2023.111935

work page doi:10.1016/j.jss.2023.111935 2024

[15] [15]

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32. doi:10.1609/aaai.v32i1.11330

work page doi:10.1609/aaai.v32i1.11330 2018

[16] [16]

Shai Gretz, Alon Halfon, Ilya Shnayderman, Orith Toledo-Ronen, Artem Spector, Lena Dankin, Yannis Katsis, Ofir Arviv, Yoav Katz, Noam Slonim, and Liat Ein-Dor. 2023. Zero-shot Topical Text Classification with LLMs - an Experimental Study. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 96...

work page doi:10.18653/v1/2023.findings-emnlp.647 2023

[17] [17]

Sanuri Dananja Gunawardena, Peter Devine, Isabelle Beaumont, Lola Piper Garden, Emerson Murphy-Hill, and Kelly Blincoe. 2022. Destructive criticism in software code review impacts inclusion.Proceedings of the ACM on Human-Computer Interaction6, CSCW2 (2022), 1–29. doi:10.1145/3555183

work page doi:10.1145/3555183 2022

[18] [18]

Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, and Hongxin Hu. 2023. An Investigation of Large Language Models for Real-World Hate Speech Detection. In2023 International Conference on Machine Learning and Applications (ICMLA). 1568–1573. doi:10.1109/ICMLA58977.2023.00237

work page doi:10.1109/icmla58977.2023.00237 2023

[19] [19]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

work page

[21] [21]

InInternational Conference on Learning Representations

LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. FSE123:22 Anindyaet al

work page 2026

[22] [22]

Mia Mohammad Imran, Robert Zita, Rahat Rizvi Rahman, Preetha Chatterjee, and Kostadin Damevski. 2026. Toxicity Ahead: Forecasting Conversational Derailment on GitHub. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26)(Rio de Janeiro, Brazil). ACM, Rio de Janeiro, Brazil. doi:10.1145/3744916. 3787839

work page doi:10.1145/3744916 2026

[23] [23]

Satyanarayana Chowdary Kadiyala, Jaydeb Sarker, and Bianca Trinkenreich. 2026. How should self-deprecation comments be classified? A toxicity analysis study on Zephyr. InProceedings of the 5th ACM/IEEE International Workshop on NL-based Software Engineering (NLBSE). doi:10.1145/3786164.3788447

work page doi:10.1145/3786164.3788447 2026

[24] [24]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and when llm-based assistants can go wrong: Investigating the effectiveness of prompt-based interactions for software help-seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. 288–303. doi:10.1145/3640543.3645200

work page doi:10.1145/3640543.3645200 2024

[25] [25]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

work page 2022

[26] [26]

Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing Toxic Content Classification for a Diversity of Perspectives. InSeventeenth Symposium on Usable Privacy and Security (SOUPS 2021). USENIX Association, 299–318. https://www.usenix.org/ conference/soups2021/prese...

work page 2021

[27] [27]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics 33, 1 (March 1977), 159. doi:10.2307/2529310

work page doi:10.2307/2529310 1977

[28] [28]

2005.The Technology Acceptance Model

Qingxiong Ma and Liping Liu. 2005.The Technology Acceptance Model. doi:10.4018/9781591404743.ch006.ch000

work page doi:10.4018/9781591404743.ch006.ch000 2005

[29] [29]

Hassan, and Xiaohu Yang

Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian Kästner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722. doi:10.1145/3510003.3510111

work page doi:10.1145/3510003.3510111 2022

[30] [30]

Shyamal Mishra and Preetha Chatterjee. 2024. Exploring ChatGPT for Toxicity Detection in GitHub(ICSE-NIER’24). Association for Computing Machinery, New York, NY, USA, 6–10. doi:10.1145/3639476.3639777

work page doi:10.1145/3639476.3639777 2024

[31] [31]

Subhabrata Mukherjee, Ahmed Hassan Awadallah, and Jianfeng Gao. 2021. XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation. doi:10.48550/ARXIV.2106.04563

work page doi:10.48550/arxiv.2106.04563 2021

[32] [32]

Emerson Murphy-Hill, Jill Dicker, Delphine Carlson, Marian Harbach, Ambar Murillo, and Tao Zhou. 2024. Did Gerrit’s Respectful Code Review Reminders Reduce Comment Toxicity? InEquity, Diversity, and Inclusion in Software Engineering: Best Practices and Insights. Apress Berkeley, CA, 309–321. doi:10.1007/978-1-4842-9651-6_18

work page doi:10.1007/978-1-4842-9651-6_18 2024

[33] [33]

Emerson Murphy-Hill, Ciera Jaspan, Carolyn Egelman, and Lan Cheng. 2022. The pushback effects of race, ethnicity, gender, and age in code review.Commun. ACM65, 3 (2022), 52–57. doi:10.1145/3474097

work page doi:10.1145/3474097 2022

[34] [34]

Nicole Novielli, Fabio Calefato, Filippo Lanubile, and Alexander Serebrenik. 2021. Assessment of off-the-shelf SE- specific sentiment analysis tools: An extended replication study.Empirical Software Engineering26, 4 (2021), 77. doi:10.1007/s10664-021-09960-w

work page doi:10.1007/s10664-021-09960-w 2021

[35] [35]

Phil Sidney Ostheimer, Mayank Kumar Nagda, Marius Kloft, and Sophie Fellenz. 2024. Text Style Transfer Evaluation Using Large Language Models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 15802–15822. https:// aclanthology.org/202...

work page 2024

[36] [36]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

work page 2002

[37] [37]

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style Transfer Through Back- Translation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 866–876. doi:...

work page doi:10.18653/v1/p18-1080 2018

[38] [38]

Huilian Sophie Qiu, Bogdan Vasilescu, Christian Kästner, Carolyn Egelman, Ciera Jaspan, and Emerson Murphy-Hill

work page

[39] [39]

Gerosa, and Igor Steinmacher

Detecting interpersonal conflict in issues and code review: cross pollinating open- and closed-source approaches. InProceedings of the 2022 ACM/IEEE 44th International Conference on Software Engineering: Software Engineering in Society(Pittsburgh, Pennsylvania)(ICSE-SEIS ’22). Association for Computing Machinery, New York, NY, USA, 41–55. doi:10.1145/3510...

work page doi:10.1145/3510458.3513019 2022

[40] [40]

Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion.Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10.1145/3660780

work page doi:10.1145/3660780 2024

[41] [41]

Naveen Raman, Minxuan Cao, Yulia Tsvetkov, Christian Kästner, and Bogdan Vasilescu. 2020. Stress and burnout in open source: Toward finding, understanding, and mitigating unhealthy interactions. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 57–60. doi:10.1145/3377816.3381732

work page doi:10.1145/3377816.3381732 2020

[42] [42]

Sarthak Roy, Ashish Harshvardhan, Animesh Mukherjee, and Punyajoy Saha. 2023. Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of the Association for Computational Linguistics: EMNLP 2023. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. ToxiShield: Promoting Inclusive Developer Communic...

work page doi:10.18653/v1/2023.findings-emnlp.407 2023

[43] [43]

Elvis Saravia. 2022. Prompt Engineering Guide.https://github.com/dair-ai/Prompt-Engineering-Guide(12 2022)

work page 2022

[44] [44]

Jaydeb Sarker, Sayma Sultana, Steven R Wilson, and Amiangshu Bosu. 2023. ToxiSpanSE: An explainable toxicity detection in code review comments. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12. doi:10.1109/ESEM56168.2023.10304855

work page doi:10.1109/esem56168.2023.10304855 2023

[45] [45]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2020. A benchmark study of the contemporary toxicity detectors on software engineering interactions. In2020 27th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 218–227. doi:10.1109/APSEC51365.2020.00030

work page doi:10.1109/apsec51365.2020.00030 2020

[46] [46]

Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

work page doi:10.1145/3715744 2025

[47] [47]

Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Automated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10. 1145/3583562

work page 2023

[48] [48]

Sulayman K Sowe, Ioannis Stamelos, and Lefteris Angelis. 2008. Understanding knowledge sharing activities in free/open source software projects: An empirical study.Journal of Systems and Software81, 3 (2008), 431–446. doi:10.1016/j.jss.2007.03.086

work page doi:10.1016/j.jss.2007.03.086 2008

[49] [49]

Sayma Sultana, Jaydeb Sarker, Farzana Israt, Rajshakhar Paul, and Amiangshu Bosu. 2025. Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments.ACM Transactions on Software Engineering Methodology (TOSEM)34 (2025). doi:10.1145/3757739

work page doi:10.1145/3757739 2025

[50] [50]

Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. 2023. Text Classification via Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023. 8990–9005. doi:10.18653/v1/2023.findings-emnlp.603

work page doi:10.18653/v1/2023.findings-emnlp.603 2023

[51] [51]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Bianca Trinkenreich, Mariam Guizani, Igor Wiese, Tayana Conte, Marco Gerosa, Anita Sarma, and Igor Steinmacher

work page

[53] [53]

doi:10.1109/TSE.2021.3108032

Pots of gold at the end of the rainbow: what is success for open source contributors?IEEE Transactions on Software Engineering48, 10 (2021), 3940–3953. doi:10.1109/TSE.2021.3108032

work page doi:10.1109/tse.2021.3108032 2021

[54] [54]

Robert A Virzi. 1992. Refining the test phase of usability evaluation: how many subjects is enough?Human factors34, 4 (1992), 457–468

work page 1992

[55] [55]

Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2023. Large Language Models Are Zero-Shot Text Classifiers. doi:10. 48550/ARXIV.2312.01044

work page arXiv 2023

[56] [56]

Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics7 (09 2019), 625–641. doi:10.1162/tacl_a_00290

work page doi:10.1162/tacl_a_00290 2019

[57] [57]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou

work page

[58] [58]

InAdvances in Neural Information Processing Systems, Vol

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/ 2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

work page 2022

[59] [59]

John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy,...

work page doi:10.18653/v1/p19-1427 2019

[60] [60]

Frances Zlotnick. 2017. GitHub Open Source Survey 2017. http://opensourcesurvey.org/2017/. doi:10.5281/zenodo. 806811 Received 2025-09-10; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026

work page doi:10.5281/zenodo 2017