arxiv: 2604.16756 · v2 · submitted 2026-04-18 · 💻 cs.SE · cs.AI

Recognition: unknown

Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering

Francesco Sovrano , Gabriele Dominici , Alberto Bacchelli

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:34 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords prompt engineeringcognitive biasessoftware engineeringAI decision supportbias mitigationaxiomatic reasoninggeneral-purpose AI

0 comments

The pith

An end-to-end prompt method that elicits best practices and axiomatic cues cuts AI bias sensitivity by 51 percent in software engineering tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how small changes in wording can alter general-purpose AI decisions on software engineering problems even when the underlying task logic remains unchanged. Standard prompt techniques such as chain-of-thought reasoning and self-debiasing show no reliable per-bias improvement. The authors instead treat reasoning as an explicit process of surfacing background axioms and best practices, then insert targeted cues to enforce that process before the model answers. This approach produces a statistically significant average reduction in bias sensitivity across eight SE-relevant biases. A reader would care because software engineering decisions frequently depend on natural-language requirements where framing effects can steer outcomes toward suboptimal choices.

Core claim

The paper claims that bias-inducing features in prompts short-circuit the elicitation of implicit assumptions, and that an end-to-end method which first elicits relevant SE best practices and then injects axiomatic reasoning cues into the prompt before answering reduces overall bias sensitivity by 51 percent on average.

What carries the argument

The end-to-end method that elicits best practices and injects axiomatic reasoning cues to block bias short-circuiting.

If this is right

Practitioners gain an off-the-shelf technique they can insert before any GPAI call on SE dilemmas.
The thematic analysis identifies linguistic patterns that mark cases where GPAI decision support is more prone to bias.
The same prompt structure may be reused across multiple SE tasks without retraining the underlying model.
Future countermeasures can be focused on the specific linguistic triggers the analysis surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might transfer to other domains that rely on implicit domain axioms in natural-language prompts, such as legal or medical decision support.
Extending the PROBE-SWE benchmark to additional biases or larger models would test whether the 51 percent reduction holds under broader conditions.
If the explicit-axiom step proves robust, it offers a lightweight alternative to fine-tuning for bias control in deployed systems.

Load-bearing premise

That forcing explicit background axioms in a Prolog-style manner actually stops biased shortcuts rather than merely adding another layer of prompt variation that could introduce new biases or lower performance on unbiased cases.

What would settle it

Applying the method to the PROBE-SWE benchmark pairs and observing no statistically significant drop in overall bias sensitivity, or finding that it increases bias on some unbiased versions of the dilemmas, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16756 by Alberto Bacchelli, Francesco Sovrano, Gabriele Dominici.

**Figure 2.** Figure 2: Concrete PROBE-SWE example (confirmation bias). Matched unbiased and biased dilemma prompts differ only by a highlighted prior-success cue. 3 RQ1: Known Bias Mitigation Prompts RQ1: How do known prompt engineering strategies reduce bias sensitivity across GPAI systems? Methodology. To answer RQ1, we use the PROBE-SWE benchmark (cf. §2), studying how different prompting strategies affect cognitive bias sens… view at source ↗

**Figure 3.** Figure 3: Bias sensitivity across prompting strategies (higher is worse). Strategies are grouped by research [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Strategy effectiveness by GPAI model (higher values indicate worse performance and higher bias [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Strategy effectiveness by complexity tier (higher values indicate worse performance). Complexity tiers [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Effects of lexicon features on bias sensitivity (log rate ratio). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Open-ended SE dilemmas: mean bias sensitivity rate by model and prompting strategy. Lower is better. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Prompt-induced cognitive biases are changes in a general-purpose AI (GPAI) system's decisions caused solely by biased wording in the input (e.g., framing, anchors), not task logic. In software engineering (SE) decision support (where problem statements and requirements are natural language) small phrasing shifts (e.g., popularity hints or outcome reveals) can push GPAI models toward suboptimal decisions. We study this with PROBE-SWE, a dynamic benchmark for SE that pairs biased and unbiased versions of the same SE dilemmas, controls for logic and difficulty, and targets eight SE-relevant biases (anchoring, availability, bandwagon, confirmation, framing, hindsight, hyperbolic discounting, overconfidence). We ask whether prompt engineering mitigates bias sensitivity in practice, focusing on actionable techniques that practitioners can apply off-the-shelf in real environments. Testing common strategies (e.g., chain-of-thought, self-debiasing) on cost-effective GPAI systems, we find no statistically significant reductions in bias sensitivity on a per-bias basis. We then adopt a Prolog-style view of the reasoning process: solving SE dilemmas requires making explicit any background axioms and inference assumptions (i.e., SE best practices) that are usually implicit in the prompt. So, we hypothesize that bias-inducing features short-circuit assumption elicitation, pushing GPAI models toward biased shortcuts. Building on this, we introduce an end-to-end method that elicits best practices and injects axiomatic reasoning cues into the prompt before answering, reducing overall bias sensitivity by 51% on average (p < .001). Finally, we report a thematic analysis that surfaces linguistic patterns associated with heightened bias sensitivity, clarifying when GPAI use is less advisable for SE decision support and where to focus future countermeasures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New SE-specific bias benchmark plus a 51% reduction via upfront axiom injection, but no ablations against length-matched prompts so the mechanism stays unproven.

read the letter

The paper introduces PROBE-SWE, a benchmark that pairs biased and unbiased versions of the same software engineering decision problems while holding logic and difficulty fixed. It targets eight biases relevant to requirements and design work. The authors test common prompt techniques, find they produce no reliable per-bias improvement, and then describe a method that first elicits SE best-practice axioms and inserts them as explicit cues before the original query, reporting a 51% average drop in bias sensitivity with p < .001. A short thematic analysis of linguistic patterns rounds out the results. This is new because the benchmark is domain-specific and the axiomatic-elicitation step is not in the cited prior work. The work does well by showing that off-the-shelf methods fall short in this setting and by giving practitioners some concrete linguistic red flags to watch for. The soft spots sit in the causal claim. The hypothesis is that bias features short-circuit assumption elicitation, yet the paper supplies no ablation that adds equivalent tokens or reasoning steps without the actual SE axioms. A neutral instruction to list relevant practices could produce the same drop, which would turn the result into another case of prompt elaboration rather than support for the Prolog-style mechanism. The abstract also omits checks on performance for the unbiased items alone, temperature settings, order effects, and whether the axioms were model-generated or human-written. Those gaps make the 51% figure hard to interpret cleanly. This is for people who build or evaluate prompt-driven tools for software engineering. The benchmark itself is the piece most likely to be reused. It deserves peer review because the problem is practical, the numbers are large, and the benchmark gives others a way to test the same issues, even if the current controls need tightening before the mechanism can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper introduces PROBE-SWE, a benchmark pairing biased and unbiased versions of the same SE dilemmas while controlling for logic and difficulty, targeting eight SE-relevant cognitive biases. Standard prompt techniques (CoT, self-debiasing) show no significant per-bias reduction, but a new end-to-end method that first elicits SE best practices as explicit axioms (Prolog-style) and injects them before answering reports a 51% average drop in bias sensitivity (p < .001). A thematic analysis of linguistic patterns associated with bias sensitivity is also presented.

Significance. If the 51% reduction holds under proper controls, the work would supply a practical, off-the-shelf prompt intervention for reducing bias in GPAI-supported SE decisions and a reusable benchmark for studying prompt-induced biases in the domain. The explicit-axiom framing offers a concrete hypothesis about why biases arise in SE contexts.

major comments (3)

[Results] Results section (the 51% claim): No ablation is reported against length-matched or structure-matched controls that add equivalent tokens or reasoning steps while omitting the axiomatic/Prolog-style content. Without this, it is impossible to isolate whether the reduction stems from forcing explicit background axioms or from generic prompt elaboration.
[Benchmark description] Benchmark and experimental setup: The manuscript provides no detail on controls for model temperature, prompt order effects, or whether the axiomatic cues were generated by the same model or by humans. These omissions directly affect the reliability of the paired biased/unbiased comparisons and the reported p < .001.
[Results] Results section: Performance deltas on the unbiased items alone are not reported. This prevents determining whether the method improves overall decision quality or simply shifts outputs toward more conservative/verbose behavior across both biased and unbiased cases.

minor comments (2)

[Experimental setup] Specify the exact models, versions, and temperature settings used for all experiments to support reproducibility.
[Methods] Define the precise formula for 'bias sensitivity' (the metric underlying the 51% figure) with an equation or pseudocode in the methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the manuscript. We address each major comment point by point below, with clarifications and commitments to revisions that improve the work without altering its core claims.

read point-by-point responses

Referee: [Results] Results section (the 51% claim): No ablation is reported against length-matched or structure-matched controls that add equivalent tokens or reasoning steps while omitting the axiomatic/Prolog-style content. Without this, it is impossible to isolate whether the reduction stems from forcing explicit background axioms or from generic prompt elaboration.

Authors: We agree that length- and structure-matched ablations are required to isolate the contribution of the axiomatic content. In the revised manuscript we will add these controls: prompts of matched token length and reasoning-step count that perform generic elaboration or additional inference steps but omit the SE best-practice axioms. This will allow direct comparison to confirm that the 51% bias-sensitivity reduction is attributable to the explicit axioms rather than prompt elaboration alone. revision: yes
Referee: [Benchmark description] Benchmark and experimental setup: The manuscript provides no detail on controls for model temperature, prompt order effects, or whether the axiomatic cues were generated by the same model or by humans. These omissions directly affect the reliability of the paired biased/unbiased comparisons and the reported p < .001.

Authors: We acknowledge the omission of these details. The revised Benchmark and Experimental Setup section will specify that (i) temperature was fixed at 0.0 for deterministic sampling, (ii) prompt presentation order was randomized across all trials to mitigate order effects, and (iii) axiomatic cues were elicited from the same model in a preliminary stage before being injected into the final prompt. These additions will support reproducibility and the validity of the statistical results. revision: yes
Referee: [Results] Results section: Performance deltas on the unbiased items alone are not reported. This prevents determining whether the method improves overall decision quality or simply shifts outputs toward more conservative/verbose behavior across both biased and unbiased cases.

Authors: We agree that reporting performance on the unbiased items is essential. In the revised Results section we will include decision-quality metrics (accuracy relative to the unbiased ground truth) for both baseline and PROBE-SWE conditions on the unbiased dilemma versions. This will clarify whether the intervention improves overall SE decision quality or produces uniform shifts in output style. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reduction measured on independent benchmark

full rationale

The paper defines PROBE-SWE as a paired biased/unbiased benchmark, adopts a Prolog-style hypothesis about axiom elicitation, introduces a prompt method, and reports an empirical 51% average bias-sensitivity drop (p < .001) from testing. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes smuggled via prior work appear in the provided text. The 51% figure is a direct measurement on the benchmark rather than a quantity forced by construction from the hypothesis or inputs. The derivation chain remains self-contained against external benchmarks and does not reduce to renaming or self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that SE decision problems can be usefully decomposed into explicit background axioms that are normally left implicit in prompts. No free parameters are introduced. No new physical or computational entities are postulated.

axioms (1)

domain assumption Solving SE dilemmas requires making explicit any background axioms and inference assumptions (i.e., SE best practices) that are usually implicit in the prompt.
Stated in the abstract as the Prolog-style view that motivates the new method.

pith-pipeline@v0.9.0 · 5622 in / 1479 out tokens · 33469 ms · 2026-05-10T07:34:56.553396+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures.Information Processing & Management 39, 1 (2003), 45–65

2003
[2]

Muhammad Azeem Akbar, Arif Ali Khan, and Peng Liang. 2025. Ethical aspects of ChatGPT in software engineering research.IEEE Transactions on Artificial Intelligence6, 2 (Feb. 2025), 254–267. doi:10.1109/TAI.2023.3318183

work page doi:10.1109/tai.2023.3318183 2025
[3]

Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, and Chirag Shah. 2024. Can LLMs Reason Like Humans? Assessing Theory of Mind Reasoning in LLMs for Open-Ended Questions. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). Association for Computing Machinery, New York, NY, USA, 34–4...

work page doi:10.1145/3627673.3679832 2024
[4]

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2022. Machine Bias. InEthics of Data and Analytics: Concepts and Cases, Kirsten Martin (Ed.). Auerbach Publications (CRC Press), Boca Raton, FL, USA, 254–264. doi:10.1 201/9781003278290-37 Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering 21

2022
[5]

Anthropic. 2026. Claude Opus: Availability and pricing. https://platform.claude.com/docs/en/about-claude/pricing. Accessed: 2026-02-16

2026
[6]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological)57, 1 (1995), 289–300

1995
[7]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[8]

Gul Calikli and Ayse Basar Bener. 2010. Empirical Analyses of the Factors Affecting Confirmation Bias and the Effects of Confirmation Bias on Software Developer/Tester Performance. InProceedings of the 6th International Conference on Predictive Models in Software Engineering (PROMISE ’10)(Timisoara, Romania). Association for Computing Machinery, New York,...

work page doi:10.1145/1868328.1868344 2010
[9]

Colin Cameron and Pravin K

A. Colin Cameron and Pravin K. Trivedi. 2013.Regression Analysis of Count Data(2 ed.). Cambridge University Press, Cambridge, UK. doi:10.1017/CBO9781139013567

work page doi:10.1017/cbo9781139013567 2013
[10]

Sanchez, Rahul Pandita, and Anita Sarma

Souti Chattopadhyay, Nicholas Nelson, Audrey Au, Natalia Morales, Christopher A. Sanchez, Rahul Pandita, and Anita Sarma. 2020. A Tale from the Trenches: Cognitive Biases and Software Development. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20)(Seoul, Republic of Korea). Association for Computing Machinery, N...

2020
[11]

Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2023. A comprehensive empirical study of bias mitigation methods for machine learning classifiers.ACM transactions on software engineering and methodology32, 4 (2023), 1–30

2023
[12]

Andrei-Victor Chisca, Andrei-Cristian Rad, and Camelia Lemnaru. 2024. Prompting Fairness: Learning Prompts for Debiasing Large Language Models. InProceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion (LTEDI 2024). Association for Computational Linguistics, St. Julian’s, Malta, 52–62. doi:10.18653/v1/2024.ltedi- 1.6

work page doi:10.18653/v1/2024.ltedi- 2024
[13]

Colton Clemmer, Junhua Ding, and Yunhe Feng. 2024. PreciseDebias: An Automatic Prompt Engineering Approach for Generative AI To Mitigate Image Demographic Biases. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)(Waikoloa, HI, USA). IEEE, Piscataway, NJ, USA, 8596–8605

2024
[14]

2014.Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory(4 ed.)

Juliet Corbin and Anselm Strauss. 2014.Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory(4 ed.). SAGE Publications, Thousand Oaks, CA, USA

2014
[15]

Corbin and Anselm Strauss

Juliet M. Corbin and Anselm Strauss. 1990. Grounded Theory Research: Procedures, Canons, and Evaluative Criteria. Qualitative Sociology13, 1 (1990), 3–21. doi:10.1007/BF00988593

work page doi:10.1007/bf00988593 1990
[16]

Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 9993–10001. doi:10.18653/v1/2023.findings-emnlp.669

work page doi:10.18653/v1/2023.findings-emnlp.669 2023
[17]

Zackary Okun Dunivin. 2025. Scaling hermeneutics: a guide to qualitative coding with LLMs for reflexive content analysis.EPJ Data Sci.14, 1 (2025), 28. doi:10.1140/EPJDS/S13688-025-00548-8

work page doi:10.1140/epjds/s13688-025-00548-8 2025
[18]

Fred Ederer and Nathan Mantel. 1974. Confidence limits on the ratio of two Poisson variables.American Journal of Epidemiology100, 3 (1974), 165–167

1974
[19]

standard adult

Alvan R. Feinstein and Domenic V. Cicchetti. 1990. High Agreement but Low Kappa: I. The Problems of Two Paradoxes. Journal of Clinical Epidemiology43, 6 (1990), 543–549. doi:10.1016/0895-4356(90)90158-L

work page doi:10.1016/0895-4356(90)90158-l 1990
[20]

Samuel Ferino, Rashina Hoda, John Grundy, and Christoph Treude. 2025. Junior Software Developers’ Perspectives on Adopting LLMs for Software Engineering: A Systematic Literature Review. arXiv. arXiv:2503.07556 https: //arxiv.org/abs/2503.07556

work page arXiv 2025
[21]

Marvin Fleischmann, Miglena Amirpur, Alexander Benlian, and Thomas Hess. 2014. Cognitive Biases in Information Systems Research: A Scientometric Analysis. InProceedings of the European Conference on Information Systems (ECIS 2014). AIS Electronic Library (AISeL), Tel Aviv, Israel, 1–21. https://aisel.aisnet.org/ecis2014/proceedings/track02/5/

2014
[22]

Raymond Fok and Daniel S Weld. 2024. In search of verifiability: Explanations rarely enable complementary perfor- mance in AI-advised decision making.AI Magazine45, 3 (2024), 317–332

2024
[23]

Shaz Furniturewala, Surgan Jandial, Abhinav Java, Pragyan Banerjee, Simra Shahid, Sumit Bhatia, and Kokil Jaidka. 2024. Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models. arXiv. arXiv:2405.10431 https://arxiv.org/abs/2405.10431

work page arXiv 2024
[24]

Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making.Proceedings of the ACM on human-computer interaction3, CSCW (2019), 1–24

2019
[25]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv. arXiv:2501.12948 https://arxiv.org/abs/2501.12948 22 Francesco Sovrano, Gabriele Dominici, and Alberto Bacchelli

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Yue Guo, Yi Yang, and Ahmed Abbasi. 2022. Auto-Debias: Debiasing Masked Language Models with Automated Biased Prompts. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland...

work page doi:10.18653/v1/2022.acl-long.72 2022
[27]

Kilem L. Gwet. 2008. Computing Inter-Rater Reliability and Its Variance in the Presence of High Agreement.Brit. J. Math. Statist. Psych.61, 1 (2008), 29–48. doi:10.1348/000711006X126600

work page doi:10.1348/000711006x126600 2008
[28]

Terlecky, et al

Zhiyong Han, Aaron Tavasi, JuYoung Lee, Joshua Luzuriaga, Kevin Suresh, Michael Oppenheim, Fortunato Battaglia, Stanley R. Terlecky, et al . 2025. Can Large Language Models Be Used to Code Text for Thematic Analysis? An Explorative Study.Discover Artificial Intelligence5 (2025), 171. doi:10.1007/s44163-025-00441-3

work page doi:10.1007/s44163-025-00441-3 2025
[29]

Haselton, Daniel Nettle, and Paul W

Martie G. Haselton, Daniel Nettle, and Paul W. Andrews. 2015. The Evolution of Cognitive Bias. InThe Handbook of Evolutionary Psychology, David M. Buss (Ed.). Wiley, Hoboken, NJ, USA, 724–746. doi:10.1002/9781119125563.evpsyc h241

work page doi:10.1002/9781119125563.evpsyc 2015
[30]

Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2021. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. 16 pages. arXiv:2111.09543 [cs.CL] doi:10.48550/arXiv .2111.09543 arXiv:2111.09543v4 (revised 2023-03-24)

work page internal anchor Pith review doi:10.48550/arxiv 2021
[31]

Ken Hyland. 2005. Stance and engagement: A model of interaction in academic discourse.Discourse studies7, 2 (2005), 173–192

2005
[32]

Gajewski, Asbjorn Hrobjartsson, Chris Roberts, Mohamed Shoukri, and David L

Jan Kottner, Laurent Audigé, Stig Brorson, Allan Donner, Byron J. Gajewski, Asbjorn Hrobjartsson, Chris Roberts, Mohamed Shoukri, and David L. Streiner. 2011. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed.Journal of Clinical Epidemiology64, 1 (2011), 96–106. doi:10.1016/j.jclinepi.2010.03.002

work page doi:10.1016/j.jclinepi.2010.03.002 2011
[33]

Matthew Lombard, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. 2002. Content Analysis in Mass Commu- nication: Assessment and Reporting of Intercoder Reliability.Human Communication Research28, 4 (2002), 587–604. doi:10.1111/j.1468-2958.2002.tb00826.x

work page doi:10.1111/j.1468-2958.2002.tb00826.x 2002
[34]

Peter McCullagh and John A. Nelder. 1989.Generalized Linear Models(2 ed.). Chapman & Hall/CRC, London

1989
[35]

Mary L. McHugh. 2012. Interrater Reliability: The Kappa Statistic.Biochemia Medica22, 3 (2012), 276–282. doi:10.116 13/BM.2012.031

2012
[36]

Rahul Mohanani, Iflaah Salman, Burak Turhan, Pilar Rodríguez, and Paul Ralph. 2018. Cognitive biases in software engineering: A systematic mapping study.IEEE Transactions on Software Engineering46, 12 (2018), 1318–1339

2018
[37]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2024. Arithmetic without Algorithms: Language Models Solve Math with a Bag of Heuristics. arXiv. arXiv:2410.21272 https://arxiv.org/abs/2410.21272

work page arXiv 2024
[38]

OpenAI. 2026. OpenAI API Pricing. https://developers.openai.com/api/docs/pricing. Accessed: 2026-02-16

2026
[39]

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–28

2025
[40]

Przyborowski and H

J. Przyborowski and H. Wilenski. 1940. Homogeneity of results in testing samples from Poisson series: With an application to testing clover seed for dodder.Biometrika31, 3-4 (1940), 313–323. doi:10.1093/biomet/31.3-4.313

work page doi:10.1093/biomet/31.3-4.313 1940
[41]

Hongye Qiu, Yue Xu, Meikang Qiu, and Wenjie Wang. 2025. DR. GAP: Mitigating Bias in Large Language Models using Gender-Aware Prompting with Demonstration and Reasoning. arXiv. arXiv:2502.11603 https://arxiv.org/abs/25 02.11603

work page arXiv 2025
[42]

Asha Rajbhoj, Akanksha Somase, Piyush Kulkarni, and Vinay Kulkarni. 2024. Accelerating Software Development Using Generative AI: ChatGPT Case Study. InProceedings of the 17th Innovations in Software Engineering Conference (ISEC 2024)(Bangalore, India). Association for Computing Machinery, New York, NY, USA, 1–11

2024
[43]

Amirali Sajadi, Kostadin Damevski, and Preetha Chatterjee. 2025. Psycholinguistic Analyses in Software Engineering Text: A Systematic Literature Review. arXiv. arXiv:2503.05992 https://arxiv.org/abs/2503.05992

work page arXiv 2025
[44]

2011.Fundamentals of Qualitative Research

Johnny Saldaña. 2011.Fundamentals of Qualitative Research. Oxford University Press, New York, NY

2011
[45]

2025.The Coding Manual for Qualitative Researchers

Johnny Saldaña. 2025.The Coding Manual for Qualitative Researchers. SAGE Publications Ltd, London, UK

2025
[46]

Aleix Sant, Carlos Escolano, Audrey Mash, Francesca De Luca Fornaciari, and Maite Melero. 2024. The Power of Prompts: Evaluating and Mitigating Gender Bias in MT with LLMs. arXiv. arXiv:2407.18786 https://arxiv.org/abs/2407.18786

work page arXiv 2024
[47]

Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp.Transactions of the Association for Computational Linguistics9 (2021), 1408–1424

2021
[48]

Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering

Francesco Sovrano. 2025. Replication Package of "Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering". https://github.com/Francesco-Sovrano/GPAI-sensitivity-to-cognitive-bias-in-software- engineering

2025
[49]

Francesco Sovrano, Gabriele Dominici, Rita Sevastjanova, Alessandra Stramiglio, and Alberto Bacchelli. 2025. Is General-Purpose AI Reasoning Sensitive to Data-Induced Cognitive Biases? Dynamic Benchmarking on Typical Software Engineering Dilemmas. arXiv. arXiv:2508.11278 https://arxiv.org/abs/2508.11278

work page arXiv 2025
[50]

2025.MAGIX: A Unified Framework for the Use of XAI in Enterprises

Francesco Sovrano and Tim Miller. 2025.MAGIX: A Unified Framework for the Use of XAI in Enterprises. Springer Nature Switzerland, Cham, 183–209. doi:10.1007/978-3-032-01940-0_6 Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering 23

work page doi:10.1007/978-3-032-01940-0_6 2025
[51]

Benyamin Tabarsi, Heidi Reichert, Ally Limke, Sandeep Kuttal, and Tiffany Barnes. 2025. LLMs’ Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters. arXiv. arXiv:2503.05012 https://arxiv.org/abs/2503.05012

work page arXiv 2025
[52]

Yla R Tausczik and James W Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods.Journal of language and social psychology29, 1 (2010), 24–54

2010
[53]

Amos Tversky and Daniel Kahneman. 1974. Judgment under Uncertainty: Heuristics and Biases.Science185, 4157 (1974), 1124–1131. doi:10.1126/science.185.4157.1124

work page doi:10.1126/science.185.4157.1124 1974
[54]

Helena Vasconcelos, Matthew Jörke, Madeleine Grunde-McLaughlin, Tobias Gerstenberg, Michael S Bernstein, and Ranjay Krishna. 2023. Explanations can reduce overreliance on ai systems during decision-making.Proceedings of the ACM on Human-Computer Interaction7, CSCW1 (2023), 1–38

2023
[55]

Jonathan Wang and Donald A Redelmeier. 2024. Cognitive biases and artificial intelligence.NEJM AI1, 12 (2024), AIcs2400639

2024
[56]

Thomas Weber, Maximilian Brandmaier, Albrecht Schmidt, and Sven Mayer. 2024. Significant productivity gains through programming with large language models.Proceedings of the ACM on Human-Computer Interaction8, EICS (2024), 1–29

2024
[57]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[58]

Chuanchi Wen, Paul Clough, Rachel Paton, and Rebecca Middleton. 2026. Leveraging Large Language Models for Thematic Analysis: A Case Study in the Charity Sector.AI & Society41 (2026), 731–748. doi:10.1007/s00146-025-02487-4

work page doi:10.1007/s00146-025-02487-4 2026
[59]

Halbert White. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica48, 4 (1980), 817–838

1980
[60]

Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. 2024. DevGPT: Studying Developer-ChatGPT Conversations. In21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024, Lisbon, Portugal, April 15-16, 2024, Diomidis Spinellis, Alberto Bacchelli, and Eleni Constantinou (Eds.). ACM, Lisbon, Portugal, 227–230. doi:10.1145/3...

work page doi:10.1145/3643991.3648400 2024
[61]

Ziyang Xu, Keqin Peng, Liang Ding, Dacheng Tao, and Xiliang Lu. 2024. Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction. arXiv. arXiv:2403.09963 https://arxiv.org/abs/2403.09963

work page arXiv 2024
[62]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] doi:10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025