Recognition: no theorem link
Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study
Pith reviewed 2026-05-11 01:45 UTC · model grok-4.3
The pith
Multi-shot prompting significantly improves agreement with human coders only for Claude Haiku in qualitative coding of psychological safety.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study finds that switching from zero-shot to multi-shot closed coding raises Cohen's kappa agreement for Claude Haiku by 0.034 with statistical significance, but produces no reliable change for DeepSeek-Chat or Gemini 2.5 Flash. DeepSeek-Chat and Claude Haiku show the highest run-to-run stability while Gemini 2.5 Flash varies the most. Across all three models, predictions systematically over-represent the Sharing Negative Feedback category by factors up to 5.25 and under-represent Expressing Concerns.
What carries the argument
A controlled empirical evaluation that compares zero-shot and multi-shot prompting strategies by measuring Cohen's kappa agreement with human ground truth labels, repeated over ten independent runs for each of three LLMs on the same psychological safety coding task.
If this is right
- Researchers using Claude Haiku for similar qualitative coding tasks can expect improved agreement by switching to multi-shot prompts.
- Outputs from Gemini 2.5 Flash require extra scrutiny because of its higher variance across repeated runs.
- All three models exhibit consistent bias toward over-predicting Sharing Negative Feedback, so post hoc adjustments may be needed for accurate category distributions.
- Prompt engineering for LLM-assisted qualitative work in software engineering should be tested per model rather than assumed to work uniformly.
- Repeating the coding task ten times and averaging provides a practical check on stability before relying on an LLM for a full study.
Where Pith is reading between the lines
- Model-specific prompt effects observed here likely appear in other qualitative coding domains such as code review sentiment or team dynamics analysis.
- Hybrid workflows that let LLMs draft initial codes and humans correct the known bias categories could cut analysis time while preserving validity.
- Future experiments could examine whether larger context windows or different example selection methods reduce the stability gap seen with Gemini.
- These results suggest that as LLMs evolve, periodic re-testing of prompting strategies will be required rather than one-time guideline adoption.
Load-bearing premise
That the labels produced by human coders represent a stable and objective ground truth suitable for direct comparison with LLM outputs across the chosen dataset and prompts.
What would settle it
A replication using an independently coded dataset of software engineering community texts with different category distributions, where the multi-shot improvement for Claude Haiku disappears or the bias patterns reverse.
Figures
read the original abstract
Qualitative analysis plays a pivotal role in understanding the human and social aspects of software engineering. However, it remains a demanding process shaped by the subjective interpretation of individual researchers and sensitive to methodological choices such as prompt design. Recent advancements in Large Language Models (LLMs) offer promising opportunities to support this type of analysis, although their reliability in reproducing human qualitative reasoning under varying prompting conditions remains largely untested. This study presents a controlled empirical evaluation of three LLMs -- Claude Haiku, DeepSeek-Chat, and Gemini 2.5 Flash -- across two prompt engineering strategies (zero-shot and multi-shot closed coding), using Cohen's kappa as the primary agreement metric over ten independent runs per configuration. Results suggest that multi-shot prompting significantly improves agreement for Claude Haiku (Delta kappa = +0.034, Wilcoxon p = 0.004) but not for DeepSeek-Chat or Gemini 2.5 Flash. Intra-model stability varies substantially -- DeepSeek-Chat and Claude Haiku exhibit the lowest variance (SD approx. 0.017), while Gemini 2.5 Flash is the least stable (SD = 0.038). A systematic over-prediction of "Sharing Negative Feedback" is identified across all models (bias ratios up to 5.25x), alongside consistent under-prediction of "Expressing Concerns." Collectively, these findings provide empirical evidence for prompt engineering guidelines in LLM-assisted qualitative coding for software engineering research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a controlled empirical study evaluating zero-shot versus multi-shot prompting strategies for three LLMs (Claude Haiku, DeepSeek-Chat, Gemini 2.5 Flash) in qualitative coding of psychological safety themes from software engineering community data. Cohen's kappa is used as the primary metric of agreement with human labels across ten independent runs per configuration, with additional reporting of intra-model stability (SD of kappa) and category-specific prediction biases (e.g., over-prediction of 'Sharing Negative Feedback').
Significance. If the results hold after addressing ground-truth issues, the work supplies model-specific empirical guidance on prompt engineering for LLM-assisted qualitative analysis in software engineering, an area where such support is increasingly adopted. The controlled design, use of ten runs per condition, and application of Wilcoxon tests provide a stronger empirical foundation than single-run anecdotal comparisons.
major comments (2)
- [Methods] Methods section: Human inter-rater reliability (pairwise or multi-rater Cohen's kappa, number of coders, and disagreement resolution procedure) is not reported. Because the central claims rest on LLM-human kappa values and their deltas (e.g., +0.034 for Claude Haiku), the absence of a human baseline prevents interpretation of whether observed agreements exceed, match, or fall below typical human variability in this subjective coding task. This is load-bearing for the prompting-effect and bias conclusions.
- [Results] Results section: The claim of systematic LLM bias in over-predicting 'Sharing Negative Feedback' (bias ratios up to 5.25x) and under-predicting 'Expressing Concerns' lacks supporting human confusion matrices or label-distribution statistics. Without these, the observed patterns cannot be distinguished from possible mirroring of human labeling tendencies.
minor comments (2)
- [Abstract] Abstract: 'Delta kappa' should be written as Δκ or 'κ difference' for notational consistency with statistical reporting standards.
- [Abstract] Abstract and Methods: Dataset size (number of posts or excerpts coded), exact prompt templates, and the precise definition of the ten-run sampling procedure should be stated explicitly (or clearly referenced to an appendix) to support replication.
Simulated Author's Rebuttal
Thank you for your thorough and constructive review. We appreciate the emphasis on methodological transparency and have revised the manuscript to address the concerns about human inter-rater reliability and supporting statistics for the bias analysis.
read point-by-point responses
-
Referee: [Methods] Methods section: Human inter-rater reliability (pairwise or multi-rater Cohen's kappa, number of coders, and disagreement resolution procedure) is not reported. Because the central claims rest on LLM-human kappa values and their deltas (e.g., +0.034 for Claude Haiku), the absence of a human baseline prevents interpretation of whether observed agreements exceed, match, or fall below typical human variability in this subjective coding task. This is load-bearing for the prompting-effect and bias conclusions.
Authors: We agree that a human inter-rater reliability baseline is necessary to properly interpret the LLM-human agreement values and their changes under different prompting strategies. In the revised manuscript we have expanded the Methods section to report the number of human coders, the pairwise Cohen's kappa between them, and the procedure used to resolve disagreements. This addition supplies the missing context for evaluating the observed deltas and bias patterns against typical human variability. revision: yes
-
Referee: [Results] Results section: The claim of systematic LLM bias in over-predicting 'Sharing Negative Feedback' (bias ratios up to 5.25x) and under-predicting 'Expressing Concerns' lacks supporting human confusion matrices or label-distribution statistics. Without these, the observed patterns cannot be distinguished from possible mirroring of human labeling tendencies.
Authors: We concur that human label-distribution statistics and confusion matrices are required to distinguish LLM-specific biases from simple reproduction of the human label distribution. We have revised the Results section to include the human label frequencies across categories together with confusion matrices that compare human and LLM predictions. These additions show that the over-prediction of 'Sharing Negative Feedback' exceeds what would be expected from mirroring the human distribution alone. revision: yes
Circularity Check
No circularity in empirical comparison study
full rationale
The paper is a controlled empirical evaluation that directly measures Cohen's kappa agreement between LLM outputs and human labels across zero-shot and multi-shot prompting for three models, with ten runs per configuration and basic statistical tests (Wilcoxon, SD). No equations, derivations, fitted parameters, or self-referential definitions appear; all reported deltas, variances, and bias ratios are computed from the observed data rather than constructed from author choices. The study contains no load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central claims to inputs by definition, rendering the results self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human qualitative coding provides a reliable ground truth for evaluating LLM performance.
Reference graph
Works this paper leans on
- [1]
-
[2]
Muneera Bano, Rashina Hoda, Didar Zowghi, and Christoph Treude. 2024. Large language models for qualitative research in software engineering: exploring opportunities and challenges.Automated Software Engineering31, 1 (2024), 8
work page 2024
-
[3]
Do Nascimento, and Michelle C.G.S.P
Cauã Ferreira Barros, Bruna Borges Azevedo, Valdemar Vicente Graciano Neto, Mohamad Kassab, Marcos Kalinowski, Hugo Alexandre D. Do Nascimento, and Michelle C.G.S.P. Bandeira. 2025. Large Language Model for Qualitative Re- search: A Systematic Mapping Study. InProc. of the IEEE/ACM Int. Workshop on Methodological Issues with Empirical Studies in Software ...
work page 2025
-
[4]
Banghao Chen, Zhaofeng Zhang, Nicolas Langréné, and Shengxin Zhu. 2025. Unleashing the potential of prompt engineering for large language models. Patterns6, 6 (2025), 101260
work page 2025
-
[5]
Amy Edmondson. 1999. Psychological safety and learning behavior in work teams.Administrative Science Quarterly44, 2 (1999), 350–383
work page 1999
-
[6]
Jie Gao, Yuchen Guo, Gionnieve Lim, Tianqin Zhang, Zheng Zhang, Toby Jia- Jun Li, and Simon Tangi Perrault. 2024. CollabCoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–29
work page 2024
-
[7]
Matheus De Morais Leça, Lucas Valença, Reydne Santos, and Ronnie De Souza Santos. 2025. Applications and implications of large language models in qual- itative analysis: A new frontier for empirical software engineering. InProc. of the IEEE/ACM Int. Workshop on Methodological Issues with Empirical Studies in Software Engineering (WSESE). IEEE, 36–43
work page 2025
-
[8]
Per Lenberg, Robert Feldt, Lucas Gren, Lars Göran Wallgren Tengberg, Inga Tidefors, and Daniel Graziotin. 2024. Qualitative software engineering research: Reflections and guidelines.Journal of Software: Evolution and Process36, 6 (2024), e2607
work page 2024
-
[9]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.Comput. Surveys55, 9, Article 195 (2023), 35 pages
work page 2023
-
[10]
Beatriz Silva De Santana, Sávio Freire, Leandro Cruz, Lidivânio Monte, Manoel Mendonca, and José Amancio Macedo Santos. 2023. Exploring psychological safety in software engineering: Insights from stack exchange. InProceedings of the XXXVII Brazilian Symposium on Software Engineering (SBES). 503–513
work page 2023
-
[11]
Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, et al. 2024. The prompt report: A systematic survey of prompt engineering techniques.arXiv preprint arXiv:2406.06608(2024)
work page internal anchor Pith review arXiv 2024
-
[12]
Seaman, Rashina Hoda, and Robert Feldt
Carolyn B. Seaman, Rashina Hoda, and Robert Feldt. 2025. Qualitative Research Methods in Software Engineering: Past, Present, and Future.IEEE Transactions on Software Engineering51, 3 (2025), 783–788
work page 2025
-
[13]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in Neural Information Processing Systems 35 (2022), 24824–24837
work page 2022
-
[14]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. InProceedings of the 30th Conference on Pattern Languages of Programs (PLoP). Article 5. Alshaikh et al. A Full Prompt Specifications A.1 P0...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.