pith. sign in

arxiv: 2604.14408 · v1 · submitted 2026-04-15 · 💻 cs.SE

ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering

Pith reviewed 2026-05-10 12:38 UTC · model grok-4.3

classification 💻 cs.SE
keywords toxicity detectioncode reviewslarge language modelsbrowser extensionsoftware engineeringconstructive communicationGitHub pull requests
0
0 comments X

The pith

A browser extension detects toxic code review comments and generates constructive rewrites in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ToxiShield, a GitHub-integrated tool that identifies harmful language in pull request discussions, explains the issues, and suggests revised versions that preserve meaning while sounding more supportive. It builds this from separate models for detection, categorization, and reframing, then checks the results with both performance numbers and a small user study. A reader would care because negative exchanges in code reviews can discourage contributors and slow collaboration in open source projects. If the approach holds, everyday development environments could start actively guiding people toward clearer, less abrasive feedback.

Core claim

ToxiShield provides a three-part system for GitHub pull requests: a BERT model that flags toxic text, an LLM that assigns fine-grained toxicity categories with explanations, and a fine-tuned Llama model that rewrites toxic comments into constructive alternatives. The authors select the best-performing versions of each component after training on thousands of real code review samples and confirm the overall tool through automated accuracy metrics plus a Technology Acceptance Model survey with ten participants.

What carries the argument

The Reframer module, which takes toxic code review text and produces a revised version that maintains technical content while shifting to a non-toxic, fluent style through fine-tuning on labeled examples.

If this is right

  • Developers receive immediate, in-context suggestions that let them revise comments before posting, lowering the chance of escalating conflicts.
  • The categorization step supplies concrete reasons for flagged text, giving users repeated practice in recognizing and avoiding toxic patterns.
  • Wider use in open source repositories could increase retention of contributors who might otherwise leave after negative review experiences.
  • The browser-extension format shows a practical path for adding similar support features directly into existing collaboration platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detection and rewriting pipeline could be adapted to other developer channels such as commit messages or issue threads without major redesign.
  • Over time, repeated exposure to reframed examples might gradually change default communication habits across entire teams or projects.
  • Pairing the reframer with static analysis tools could add a check that rewritten comments still accurately describe the technical point under discussion.

Load-bearing premise

The models will keep identifying toxicity correctly and producing useful rewrites when they encounter new comments from projects and developers outside the original training data.

What would settle it

Apply the full ToxiShield pipeline to several hundred recent GitHub pull request comments never seen during training and measure whether human reviewers agree with its toxicity flags and rephrasings at rates substantially above random chance.

Figures

Figures reproduced from arXiv: 2604.14408 by Amiangshu Bosu, Anindya Iqbal, Jaydeb Sarker, Md Awsaf Alam Anindya, Showvik Biswas.

Figure 1
Figure 1. Figure 1: Motivational Workflow of ToxiShield research indicates that communication within OSS communities frequently becomes unprofessional and toxic [10, 21, 28, 39, 43–45]. This hostility produces severe consequences. It creates significant barriers for new partici￾pants [39] and causes active contributors to leave the project [43]. Furthermore, toxic interactions disproportionately affect minority and junior dev… view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of the Toxicity Filter Module where 15 M PR comments are selected from [44]. often prioritize offline classification accuracy over the latency constraints required for seamless integration into real-time developer workflows. For instance, the current state-of-the-art tool, ToxiCR [45], achieves strong performance metrics (88.9% F1-score, 95.8% accuracy); however, recent studies indicate that its e… view at source ↗
Figure 3
Figure 3. Figure 3: Communication Coach: Iterative Prompt Evolution [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion Matrix for Communication Coach, where SD represents Self Deprecation, Identity Attack [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Reframer: A teacher model is used to create a parallel dataset, which is used to train a student detoxification model with iterative prompting. model reliably detects general toxicity and generates logical rationales, it occasionally struggles to resolve the blurred boundaries between closely related interpersonal toxicities. Key Finding 2: Claude 3.5 Sonnet achieves best performance on multiclass toxi… view at source ↗
Figure 6
Figure 6. Figure 6: Browser Extension to integrate ToxiShield with GitHub-based pull request reviews [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Toxic interactions during code reviews can undermine teamwork and hinder productivity in software engineering (SE) teams. While prior studies explore toxicity detection and empirical investigation, they lack real-time detoxification tools to support the SE community. To address this gap, we present ToxiShield, a browser extension for GitHub pull requests that is built using three modules: i) Toxicity Filter -- to identify whether a text is toxic, ii) Communication coach -- to facilitate just-in-time fine-grained toxicity categorization with explanations, and iii) The Reframer -- that generates a revised, constructive alternative of a toxic text. For each module, we trained and evaluated multiple deep learning and Large Language Models (LLMs) to identify the best choice. A BERT-based binary detection model, trained on 38,761 code review samples, achieves 98% accuracy and an F1-score of 97% and is the selected one for the Toxicity Filter module. For the Communication Coach, prompt-tuned Claude 3.5 Sonnet achieved the best performance with 39% MCC and 42% F1 in multiclass toxicity classification with detailed reasoning. For Reframer, we evaluated five LLMs using a fine-tuning strategy on a dataset of 10,120 code review comments. The fine-tuned Llama 3.2 model achieves 95.27% style transfer accuracy, 97.03% fluency, 67.07% content preservation, and an 84% J-score. We further validated ToxiShield through a human evaluation using the Technology Acceptance Model with 10 participants, confirming its perceived usefulness and ease of adoption. ToxiShield sets a benchmark for advancing constructive communication in software engineering, driving inclusivity and healthier collaboration in open-source communities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ToxiShield, a browser extension for GitHub pull requests comprising three modules: a BERT-based Toxicity Filter (98% accuracy, 97% F1 on 38,761 code review samples), a prompt-tuned Claude 3.5 Sonnet Communication Coach (39% MCC, 42% F1 for multiclass categorization), and a fine-tuned Llama 3.2 Reframer (95.27% style transfer accuracy, 84% J-score on 10,120 samples). It also reports a Technology Acceptance Model human evaluation with 10 participants confirming perceived usefulness and ease of use. The work positions the tool as a benchmark for real-time detoxification to promote inclusive developer communication.

Significance. If the reported performance generalizes, ToxiShield offers a practical, integrated system for toxicity detection, explanation, and constructive reframing directly in code review workflows, addressing a gap in SE tooling. The concrete dataset sizes, multi-model evaluation, and small-scale TAM validation provide a starting point for deployment-oriented research. The combination of filter + coach + reframer is a strength, but the absence of out-of-distribution testing on live GitHub data limits claims about real-world effectiveness.

major comments (2)
  1. [Abstract] Abstract and Evaluation sections: All performance figures (BERT 98% accuracy/97% F1, Claude 39% MCC, Llama 95.27% style-transfer/84% J-score) are obtained from internal train/test splits of the curated corpora. No baseline comparisons (e.g., against prior toxicity detectors or simpler classifiers), statistical significance tests, confidence intervals, or details on annotation quality and data splits are provided, making it impossible to judge whether the selected models are meaningfully superior or reliable.
  2. [Evaluation] Deployment and Evaluation sections: The central claim that ToxiShield enables effective real-time filtering, categorization, and reframing inside actual GitHub pull requests is not supported by any held-out evaluation on heterogeneous, temporally current GitHub threads (different projects, languages, reviewer styles, or temporal drift). Toxicity labels and constructive rewrites are context-sensitive; performance on the paper's internal distributions does not establish behavior on the deployment distribution targeted by the browser extension.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'sets a benchmark' is not justified by the reported experiments; consider rephrasing to 'provides an initial integrated prototype' or similar.
  2. [Human Evaluation] The human evaluation description (10 participants, TAM) lacks details on participant recruitment, task instructions, and statistical analysis of the survey responses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of evaluation rigor and real-world applicability. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation sections: All performance figures (BERT 98% accuracy/97% F1, Claude 39% MCC, Llama 95.27% style-transfer/84% J-score) are obtained from internal train/test splits of the curated corpora. No baseline comparisons (e.g., against prior toxicity detectors or simpler classifiers), statistical significance tests, confidence intervals, or details on annotation quality and data splits are provided, making it impossible to judge whether the selected models are meaningfully superior or reliable.

    Authors: We agree that the current evaluation would be strengthened by explicit baseline comparisons, statistical significance testing, confidence intervals, and fuller details on annotation processes and data splits. In the revised manuscript we will add (1) comparisons against simpler classifiers (e.g., logistic regression on TF-IDF) and representative prior toxicity detectors, (2) McNemar or paired t-tests with p-values, (3) 95% confidence intervals via bootstrapping, and (4) expanded description of the annotation protocol, inter-annotator agreement, and train/validation/test split ratios. These additions will allow readers to better assess model reliability and superiority. revision: yes

  2. Referee: [Evaluation] Deployment and Evaluation sections: The central claim that ToxiShield enables effective real-time filtering, categorization, and reframing inside actual GitHub pull requests is not supported by any held-out evaluation on heterogeneous, temporally current GitHub threads (different projects, languages, reviewer styles, or temporal drift). Toxicity labels and constructive rewrites are context-sensitive; performance on the paper's internal distributions does not establish behavior on the deployment distribution targeted by the browser extension.

    Authors: We acknowledge that the reported metrics derive from curated, internal train/test splits rather than live, temporally current GitHub threads spanning diverse projects and reviewer styles. The datasets themselves consist of real code-review comments, and the 10-participant TAM study provides evidence of perceived usefulness in a realistic usage scenario. In revision we will (1) add an explicit limitations subsection discussing the gap between curated and live distributions, (2) report any available qualitative observations from the TAM sessions that touched on context sensitivity, and (3) outline concrete plans for future deployment studies. We cannot, however, retroactively supply new held-out live-GitHub results within the current revision cycle. revision: partial

Circularity Check

0 steps flagged

No circularity: standard empirical ML training and evaluation

full rationale

The paper presents an applied system built from three independently trained models (BERT binary classifier on 38,761 samples, prompt-tuned Claude for multiclass, fine-tuned Llama 3.2 on 10,120 samples) whose performance numbers are obtained via conventional train/test splits and a 10-participant TAM study. No equations, derivations, uniqueness theorems, or self-citations are invoked to justify the central claims; each reported metric follows directly from the training procedure and external human ratings without reducing to a definitional loop or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised learning assumptions and the premise that toxicity can be reliably labeled and reframed without loss of technical meaning; no new physical or mathematical entities are introduced.

pith-pipeline@v0.9.0 · 5634 in / 1113 out tokens · 28626 ms · 2026-05-10T12:38:03.709235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    2026.ToxiShield: Replication Package

    Md Awsaf Alam Anindya, Showvik Biswas, Anindya Iqbal, Jaydeb Sarker, and Amiangshu Bosu. 2026.ToxiShield: Replication Package. https://github.com/WSU-SEAL/ToxiShield

  2. [2]

    Sarthak Bharadwaj, Fabio Santos, and Bianca Trinkenreich. 2025. The Shifting Sands of Toxicity: The Evolving Nature of Interpersonal Challenges in Open Source. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–11. doi:10.1109/ESEM64174.2025.00016

  3. [3]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative research in psychology3, 2 (2006), 77–101

  4. [4]

    Kevin Daniel André Carillo, Josianne Marsan, and Bogdan Negoita. 2016. Towards developing a theory of toxicity in the context of free/open source software & peer production communities.SIGOPEN 2016(2016)

  5. [5]

    Ofer Dekel and Ohad Shamir. 2010. Multiclass-multilabel classification with more classes than examples. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 137–144

  6. [6]

    Tanni Dev, Sayma Sultana, and Amiangshu Bosu. 2025. Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)(Honolulu, Hawai, USA, USA). 1–12. doi:10.1109/ESEM64174.2025.00059

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 4171–4186. doi:10.18653...

  8. [8]

    Carolyn D Egelman, Emerson Murphy-Hill, Elizabeth Kammer, Margaret Morrow Hodges, Collin Green, Ciera Jaspan, and James Lin. 2020. Predicting developers’ negative feelings about code review. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 174–185. doi:10.1145/3377811.3380414

  9. [9]

    Ramtin Ehsani, Mia Mohammad Imran, Robert Zita, Kostadin Damevski, and Preetha Chatterjee. 2024. Incivility in open source projects: A comprehensive annotated dataset of locked github issue threads. InProceedings of the 21st International Conference on Mining Software Repositories. 515–519. doi:10.1145/3643991.3644887

  10. [10]

    Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2023. Exploring moral principles exhibited in oss: A case study on github heated issues. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2092–2096

  11. [11]

    Ramtin Ehsani, Rezvaneh Rezapour, and Preetha Chatterjee. 2025. Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory. (2025), 1–8. doi:10.1109/NLBSE66842.2025. 00006

  12. [12]

    Laura Faulkner. 2003. Beyond the five-user assumption: Benefits of increased sample sizes in usability testing.Behavior Research Methods, Instruments, & Computers35, 3 (2003), 379–383. doi:10.3758/BF03195514

  13. [13]

    Shut the f**k up

    Isabella Ferreira, Jinghui Cheng, and Bram Adams. 2021. The" shut the f** k up" phenomenon: Characterizing incivility in open source code review discussions.Proceedings of the ACM on Human-Computer Interaction5, CSCW2 (2021), 1–35. doi:10.1145/3479497

  14. [14]

    Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. 2024. Incivility detection in open source code review and issue discussions.Journal of Systems and Software209 (2024), 111935. doi:10.1016/j.jss.2023.111935

  15. [15]

    Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. InProceedings of the AAAI conference on artificial intelligence, Vol. 32. doi:10.1609/aaai.v32i1.11330

  16. [16]

    Shai Gretz, Alon Halfon, Ilya Shnayderman, Orith Toledo-Ronen, Artem Spector, Lena Dankin, Yannis Katsis, Ofir Arviv, Yoav Katz, Noam Slonim, and Liat Ein-Dor. 2023. Zero-shot Topical Text Classification with LLMs - an Experimental Study. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 96...

  17. [17]

    Sanuri Dananja Gunawardena, Peter Devine, Isabelle Beaumont, Lola Piper Garden, Emerson Murphy-Hill, and Kelly Blincoe. 2022. Destructive criticism in software code review impacts inclusion.Proceedings of the ACM on Human-Computer Interaction6, CSCW2 (2022), 1–29. doi:10.1145/3555183

  18. [18]

    Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, and Hongxin Hu. 2023. An Investigation of Large Language Models for Real-World Hate Speech Detection. In2023 International Conference on Machine Learning and Applications (ICMLA). 1568–1573. doi:10.1109/ICMLA58977.2023.00237

  19. [19]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531(2015)

  20. [20]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  21. [21]

    InInternational Conference on Learning Representations

    LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. FSE123:22 Anindyaet al

  22. [22]

    Mia Mohammad Imran, Robert Zita, Rahat Rizvi Rahman, Preetha Chatterjee, and Kostadin Damevski. 2026. Toxicity Ahead: Forecasting Conversational Derailment on GitHub. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26)(Rio de Janeiro, Brazil). ACM, Rio de Janeiro, Brazil. doi:10.1145/3744916. 3787839

  23. [23]

    Satyanarayana Chowdary Kadiyala, Jaydeb Sarker, and Bianca Trinkenreich. 2026. How should self-deprecation comments be classified? A toxicity analysis study on Zephyr. InProceedings of the 5th ACM/IEEE International Workshop on NL-based Software Engineering (NLBSE). doi:10.1145/3786164.3788447

  24. [24]

    Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and when llm-based assistants can go wrong: Investigating the effectiveness of prompt-based interactions for software help-seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. 288–303. doi:10.1145/3640543.3645200

  25. [25]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  26. [26]

    Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing Toxic Content Classification for a Diversity of Perspectives. InSeventeenth Symposium on Usable Privacy and Security (SOUPS 2021). USENIX Association, 299–318. https://www.usenix.org/ conference/soups2021/prese...

  27. [27]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics 33, 1 (March 1977), 159. doi:10.2307/2529310

  28. [28]

    2005.The Technology Acceptance Model

    Qingxiong Ma and Liping Liu. 2005.The Technology Acceptance Model. doi:10.4018/9781591404743.ch006.ch000

  29. [29]

    Hassan, and Xiaohu Yang

    Courtney Miller, Sophie Cohen, Daniel Klug, Bogdan Vasilescu, and Christian Kästner. 2022. " Did you miss my comment or what?" understanding toxicity in open source discussions. InProceedings of the 44th International Conference on Software Engineering. 710–722. doi:10.1145/3510003.3510111

  30. [30]

    Shyamal Mishra and Preetha Chatterjee. 2024. Exploring ChatGPT for Toxicity Detection in GitHub(ICSE-NIER’24). Association for Computing Machinery, New York, NY, USA, 6–10. doi:10.1145/3639476.3639777

  31. [31]

    Subhabrata Mukherjee, Ahmed Hassan Awadallah, and Jianfeng Gao. 2021. XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation. doi:10.48550/ARXIV.2106.04563

  32. [32]

    Emerson Murphy-Hill, Jill Dicker, Delphine Carlson, Marian Harbach, Ambar Murillo, and Tao Zhou. 2024. Did Gerrit’s Respectful Code Review Reminders Reduce Comment Toxicity? InEquity, Diversity, and Inclusion in Software Engineering: Best Practices and Insights. Apress Berkeley, CA, 309–321. doi:10.1007/978-1-4842-9651-6_18

  33. [33]

    Emerson Murphy-Hill, Ciera Jaspan, Carolyn Egelman, and Lan Cheng. 2022. The pushback effects of race, ethnicity, gender, and age in code review.Commun. ACM65, 3 (2022), 52–57. doi:10.1145/3474097

  34. [34]

    Nicole Novielli, Fabio Calefato, Filippo Lanubile, and Alexander Serebrenik. 2021. Assessment of off-the-shelf SE- specific sentiment analysis tools: An extended replication study.Empirical Software Engineering26, 4 (2021), 77. doi:10.1007/s10664-021-09960-w

  35. [35]

    Phil Sidney Ostheimer, Mayank Kumar Nagda, Marius Kloft, and Sophie Fellenz. 2024. Text Style Transfer Evaluation Using Large Language Models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 15802–15822. https:// aclanthology.org/202...

  36. [36]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  37. [37]

    Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style Transfer Through Back- Translation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 866–876. doi:...

  38. [38]

    Huilian Sophie Qiu, Bogdan Vasilescu, Christian Kästner, Carolyn Egelman, Ciera Jaspan, and Emerson Murphy-Hill

  39. [39]

    Gerosa, and Igor Steinmacher

    Detecting interpersonal conflict in issues and code review: cross pollinating open- and closed-source approaches. InProceedings of the 2022 ACM/IEEE 44th International Conference on Software Engineering: Software Engineering in Society(Pittsburgh, Pennsylvania)(ICSE-SEIS ’22). Association for Computing Machinery, New York, NY, USA, 41–55. doi:10.1145/3510...

  40. [40]

    Md Shamimur Rahman, Zadia Codabux, and Chanchal K Roy. 2024. Do Words Have Power? Understanding and Fostering Civility in Code Review Discussion.Proceedings of the ACM on Software Engineering1, FSE (2024), 1632–1655. doi:10.1145/3660780

  41. [41]

    Naveen Raman, Minxuan Cao, Yulia Tsvetkov, Christian Kästner, and Bogdan Vasilescu. 2020. Stress and burnout in open source: Toward finding, understanding, and mitigating unhealthy interactions. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results. 57–60. doi:10.1145/3377816.3381732

  42. [42]

    Sarthak Roy, Ashish Harshvardhan, Animesh Mukherjee, and Punyajoy Saha. 2023. Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of the Association for Computational Linguistics: EMNLP 2023. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026. ToxiShield: Promoting Inclusive Developer Communic...

  43. [43]

    Elvis Saravia. 2022. Prompt Engineering Guide.https://github.com/dair-ai/Prompt-Engineering-Guide(12 2022)

  44. [44]

    Jaydeb Sarker, Sayma Sultana, Steven R Wilson, and Amiangshu Bosu. 2023. ToxiSpanSE: An explainable toxicity detection in code review comments. In2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 1–12. doi:10.1109/ESEM56168.2023.10304855

  45. [45]

    Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2020. A benchmark study of the contemporary toxicity detectors on software engineering interactions. In2020 27th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 218–227. doi:10.1109/APSEC51365.2020.00030

  46. [46]

    Jaydeb Sarker, Asif Kamal Turzo, and Amiangshu Bosu. 2025. The Landscape of Toxicity: An Empirical Investigation of Toxicity on GitHub.Proceedings of the ACM on Software Engineering2, FSE (2025), 623–646. doi:10.1145/3715744

  47. [47]

    Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, and Amiangshu Bosu. 2023. Automated Identification of Toxic Code Reviews Using ToxiCR.ACM Transactions on Software Engineering and Methodology32, 5 (July 2023), 1–32. doi:10. 1145/3583562

  48. [48]

    Sulayman K Sowe, Ioannis Stamelos, and Lefteris Angelis. 2008. Understanding knowledge sharing activities in free/open source software projects: An empirical study.Journal of Systems and Software81, 3 (2008), 431–446. doi:10.1016/j.jss.2007.03.086

  49. [49]

    Sayma Sultana, Jaydeb Sarker, Farzana Israt, Rajshakhar Paul, and Amiangshu Bosu. 2025. Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments.ACM Transactions on Software Engineering Methodology (TOSEM)34 (2025). doi:10.1145/3757739

  50. [50]

    Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and Guoyin Wang. 2023. Text Classification via Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023. 8990–9005. doi:10.18653/v1/2023.findings-emnlp.603

  51. [51]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971(2023)

  52. [52]

    Bianca Trinkenreich, Mariam Guizani, Igor Wiese, Tayana Conte, Marco Gerosa, Anita Sarma, and Igor Steinmacher

  53. [53]

    doi:10.1109/TSE.2021.3108032

    Pots of gold at the end of the rainbow: what is success for open source contributors?IEEE Transactions on Software Engineering48, 10 (2021), 3940–3953. doi:10.1109/TSE.2021.3108032

  54. [54]

    Robert A Virzi. 1992. Refining the test phase of usability evaluation: how many subjects is enough?Human factors34, 4 (1992), 457–468

  55. [55]

    Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2023. Large Language Models Are Zero-Shot Text Classifiers. doi:10. 48550/ARXIV.2312.01044

  56. [56]

    Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics7 (09 2019), 625–641. doi:10.1162/tacl_a_00290

  57. [57]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou

  58. [58]

    InAdvances in Neural Information Processing Systems, Vol

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/ 2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf

  59. [59]

    John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training Neural Machine Translation with Semantic Similarity. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy,...

  60. [60]

    Frances Zlotnick. 2017. GitHub Open Source Survey 2017. http://opensourcesurvey.org/2017/. doi:10.5281/zenodo. 806811 Received 2025-09-10; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE123. Publication date: July 2026