arxiv: 2605.07422 · v1 · submitted 2026-05-08 · 💻 cs.SE · cs.AI

Recognition: no theorem link

Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

Moaath Alshaikh , Tasneem Alshaher , Ricardo Vieira , Beatriz Santana , Clelio Xavier , Jose Amancio , Glauco Carneiro , Julio Leite

show 2 more authors

Savio Freire Manoel Mendonca

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords prompt engineeringlarge language modelsqualitative codingpsychological safetysoftware engineeringCohen's kappaempirical studyLLM-assisted qualitative analysis

0 comments

The pith

Multi-shot prompting significantly improves agreement with human coders only for Claude Haiku in qualitative coding of psychological safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a controlled test of zero-shot versus multi-shot prompting to see how well three large language models can replicate human qualitative coding of psychological safety themes in software engineering community data. It measures agreement using Cohen's kappa across ten runs each and tracks both average performance and run-to-run stability. Results show a small but statistically significant gain for Claude Haiku with multi-shot examples, no gain for the other two models, and clear patterns where all models over-predict one category while under-predicting another. Readers in software engineering research would care because manual qualitative analysis is slow and prone to individual differences, so understanding when LLMs can assist reliably could make such studies more scalable. The work supplies model-specific evidence rather than blanket claims about LLM coding.

Core claim

The study finds that switching from zero-shot to multi-shot closed coding raises Cohen's kappa agreement for Claude Haiku by 0.034 with statistical significance, but produces no reliable change for DeepSeek-Chat or Gemini 2.5 Flash. DeepSeek-Chat and Claude Haiku show the highest run-to-run stability while Gemini 2.5 Flash varies the most. Across all three models, predictions systematically over-represent the Sharing Negative Feedback category by factors up to 5.25 and under-represent Expressing Concerns.

What carries the argument

A controlled empirical evaluation that compares zero-shot and multi-shot prompting strategies by measuring Cohen's kappa agreement with human ground truth labels, repeated over ten independent runs for each of three LLMs on the same psychological safety coding task.

If this is right

Researchers using Claude Haiku for similar qualitative coding tasks can expect improved agreement by switching to multi-shot prompts.
Outputs from Gemini 2.5 Flash require extra scrutiny because of its higher variance across repeated runs.
All three models exhibit consistent bias toward over-predicting Sharing Negative Feedback, so post hoc adjustments may be needed for accurate category distributions.
Prompt engineering for LLM-assisted qualitative work in software engineering should be tested per model rather than assumed to work uniformly.
Repeating the coding task ten times and averaging provides a practical check on stability before relying on an LLM for a full study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Model-specific prompt effects observed here likely appear in other qualitative coding domains such as code review sentiment or team dynamics analysis.
Hybrid workflows that let LLMs draft initial codes and humans correct the known bias categories could cut analysis time while preserving validity.
Future experiments could examine whether larger context windows or different example selection methods reduce the stability gap seen with Gemini.
These results suggest that as LLMs evolve, periodic re-testing of prompting strategies will be required rather than one-time guideline adoption.

Load-bearing premise

That the labels produced by human coders represent a stable and objective ground truth suitable for direct comparison with LLM outputs across the chosen dataset and prompts.

What would settle it

A replication using an independently coded dataset of software engineering community texts with different category distributions, where the multi-shot improvement for Claude Haiku disappears or the bias patterns reverse.

Figures

Figures reproduced from arXiv: 2605.07422 by Beatriz Santana, Clelio Xavier, Glauco Carneiro, Jose Amancio, Julio Leite, Manoel Mendonca, Moaath Alshaikh, Ricardo Vieira, Savio Freire, Tasneem Alshaher.

**Figure 2.** Figure 2: shows the 𝜅 distribution across ten runs for all models under both prompt configurations [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Per-class F1 scores (majority vote, 10 runs) aggre [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: “Expressing Concerns” prediction bias ratio across [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies model-specific numbers on multi-shot prompting for one qualitative coding task but leaves the human agreement baseline unreported, so the kappa deltas and bias claims are hard to interpret.

read the letter

This is a controlled comparison of zero-shot and multi-shot prompting across Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash on psychological safety coding from software engineering community texts. The main results are that multi-shot lifts agreement only for Claude Haiku by a small margin, that two models show lower run-to-run variance than the third, and that all three over-predict the 'Sharing Negative Feedback' category while under-predicting 'Expressing Concerns.' Those are the concrete data points the paper adds.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a controlled empirical study evaluating zero-shot versus multi-shot prompting strategies for three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) in qualitative coding of psychological safety themes from software engineering community data. Cohen's kappa is used as the primary metric of agreement with human labels across ten independent runs per configuration, with additional reporting of intra-model stability (SD of kappa) and category-specific prediction biases (e.g., over-prediction of 'Sharing Negative Feedback').

Significance. If the results hold after addressing ground-truth issues, the work supplies model-specific empirical guidance on prompt engineering for LLM-assisted qualitative analysis in software engineering, an area where such support is increasingly adopted. The controlled design, use of ten runs per condition, and application of Wilcoxon tests provide a stronger empirical foundation than single-run anecdotal comparisons.

major comments (2)

[Methods] Methods section: Human inter-rater reliability (pairwise or multi-rater Cohen's kappa, number of coders, and disagreement resolution procedure) is not reported. Because the central claims rest on LLM-human kappa values and their deltas (e.g., +0.034 for Claude Haiku), the absence of a human baseline prevents interpretation of whether observed agreements exceed, match, or fall below typical human variability in this subjective coding task. This is load-bearing for the prompting-effect and bias conclusions.
[Results] Results section: The claim of systematic LLM bias in over-predicting 'Sharing Negative Feedback' (bias ratios up to 5.25x) and under-predicting 'Expressing Concerns' lacks supporting human confusion matrices or label-distribution statistics. Without these, the observed patterns cannot be distinguished from possible mirroring of human labeling tendencies.

minor comments (2)

[Abstract] Abstract: 'Delta kappa' should be written as Δκ or 'κ difference' for notational consistency with statistical reporting standards.
[Abstract] Abstract and Methods: Dataset size (number of posts or excerpts coded), exact prompt templates, and the precise definition of the ten-run sampling procedure should be stated explicitly (or clearly referenced to an appendix) to support replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough and constructive review. We appreciate the emphasis on methodological transparency and have revised the manuscript to address the concerns about human inter-rater reliability and supporting statistics for the bias analysis.

read point-by-point responses

Referee: [Methods] Methods section: Human inter-rater reliability (pairwise or multi-rater Cohen's kappa, number of coders, and disagreement resolution procedure) is not reported. Because the central claims rest on LLM-human kappa values and their deltas (e.g., +0.034 for Claude Haiku), the absence of a human baseline prevents interpretation of whether observed agreements exceed, match, or fall below typical human variability in this subjective coding task. This is load-bearing for the prompting-effect and bias conclusions.

Authors: We agree that a human inter-rater reliability baseline is necessary to properly interpret the LLM-human agreement values and their changes under different prompting strategies. In the revised manuscript we have expanded the Methods section to report the number of human coders, the pairwise Cohen's kappa between them, and the procedure used to resolve disagreements. This addition supplies the missing context for evaluating the observed deltas and bias patterns against typical human variability. revision: yes
Referee: [Results] Results section: The claim of systematic LLM bias in over-predicting 'Sharing Negative Feedback' (bias ratios up to 5.25x) and under-predicting 'Expressing Concerns' lacks supporting human confusion matrices or label-distribution statistics. Without these, the observed patterns cannot be distinguished from possible mirroring of human labeling tendencies.

Authors: We concur that human label-distribution statistics and confusion matrices are required to distinguish LLM-specific biases from simple reproduction of the human label distribution. We have revised the Results section to include the human label frequencies across categories together with confusion matrices that compare human and LLM predictions. These additions show that the over-prediction of 'Sharing Negative Feedback' exceeds what would be expected from mirroring the human distribution alone. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison study

full rationale

The paper is a controlled empirical evaluation that directly measures Cohen's kappa agreement between LLM outputs and human labels across zero-shot and multi-shot prompting for three models, with ten runs per configuration and basic statistical tests (Wilcoxon, SD). No equations, derivations, fitted parameters, or self-referential definitions appear; all reported deltas, variances, and bias ratios are computed from the observed data rather than constructed from author choices. The study contains no load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central claims to inputs by definition, rendering the results self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that human labels are a stable gold standard and that Cohen's kappa plus Wilcoxon tests are appropriate for this setting; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human qualitative coding provides a reliable ground truth for evaluating LLM performance.
Agreement with human coders is used as the primary success metric throughout the study.

pith-pipeline@v0.9.0 · 5602 in / 1433 out tokens · 45517 ms · 2026-05-11T01:45:12.353156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, et al. 2023. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023

work page arXiv 2023
[2]

Muneera Bano, Rashina Hoda, Didar Zowghi, and Christoph Treude. 2024. Large language models for qualitative research in software engineering: exploring opportunities and challenges.Automated Software Engineering31, 1 (2024), 8

work page 2024
[3]

Do Nascimento, and Michelle C.G.S.P

Cauã Ferreira Barros, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. Do Nascimento, and Michelle C.G.S.P. Bandeira. 2025. Large Language Model for Qualitative Re- search: A Systematic Mapping Study. InProc. of the IEEE/ACM Int. Workshop on Methodological Issues with Empirical Studies in Software ...

work page 2025
[4]

Banghao Chen, Zhaofeng Zhang, Nicolas Langréné, and Shengxin Zhu. 2025. Unleashing the potential of prompt engineering for large language models. Patterns6, 6 (2025), 101260

work page 2025
[5]

Amy Edmondson. 1999. Psychological safety and learning behavior in work teams.Administrative Science Quarterly44, 2 (1999), 350–383

work page 1999
[6]

Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia- Jun Li, and Simon Tangi Perrault. 2024. CollabCoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–29

work page 2024
[7]

Matheus De Morais Leça, Lucas Valença, Reydne Santos, and Ronnie De Souza Santos. 2025. Applications and implications of large language models in qual- itative analysis: A new frontier for empirical software engineering. InProc. of the IEEE/ACM Int. Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE). IEEE, 36–43

work page 2025
[8]

Per Lenberg, Robert Feldt, Lucas Gren, Lars Göran Wallgren Tengberg, Inga Tidefors, and Daniel Graziotin. 2024. Qualitative software engineering research: Reflections and guidelines.Journal of Software: Evolution and Process36, 6 (2024), e2607

work page 2024
[9]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.Comput. Surveys55, 9, Article 195 (2023), 35 pages

work page 2023
[10]

Beatriz Silva De Santana, Sávio Freire, Leandro Cruz, Lidivânio Monte, Manoel Mendonca, and José Amancio Macedo Santos. 2023. Exploring psychological safety in software engineering: Insights from stack exchange. InProceedings of the XXXVII Brazilian Symposium on Software Engineering (SBES). 503–513

work page 2023
[11]

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. 2024. The prompt report: A systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608(2024)

work page internal anchor Pith review arXiv 2024
[12]

Seaman, Rashina Hoda, and Robert Feldt

Carolyn B. Seaman, Rashina Hoda, and Robert Feldt. 2025. Qualitative Research Methods in Software Engineering: Past, Present, and Future.IEEE Transactions on Software Engineering51, 3 (2025), 783–788

work page 2025
[13]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in Neural Information Processing Systems 35 (2022), 24824–24837

work page 2022
[14]

Drawing attention to errors

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. InProceedings of the 30th Conference on Pattern Languages of Programs (PLoP). Article 5. Alshaikh et al. A Full Prompt Specifications A.1 P0...

work page 2023