pith. sign in

arxiv: 2604.16275 · v1 · submitted 2026-04-17 · 💻 cs.CL

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Pith reviewed 2026-05-10 08:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords politenessLLMscross-linguisticPLUM corpusprompt toneresponse qualitymultilingual evaluation
0
0 comments X

The pith

Politeness in prompts improves LLM response quality by up to 11 percent but the gains are not universal across languages or models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how polite and impolite user prompts change the quality of answers from five large language models in English, Hindi, and Spanish. It generates 22,500 prompt-response pairs using three types of dialogue history and scores every response on eight quality factors. The results show that courteous language raises average scores while rude language lowers them, yet the size and direction of the change depend on the specific language and the model being used. A reader should care because this means no single politeness strategy works for every user or every AI system.

Core claim

The central claim is that politeness functions as a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. Polite prompts enhance average response quality by up to approximately 11 percent and impolite tones worsen it. English performs best with courteous or direct tones, Hindi with deferential and indirect tones, and Spanish with assertive tones. Among the models, Llama shows the largest sensitivity to tone (11.5 percent range) while GPT remains more robust to adversarial tone.

What carries the argument

The eight-factor assessment framework of coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability applied to responses generated from the PLUM corpus of 1,500 human-validated prompts across three languages and five politeness levels.

If this is right

  • Tone and prior dialogue history measurably shift model output quality in the tested settings.
  • Optimal prompt tone differs by language, with English favoring courtesy, Hindi favoring deference, and Spanish favoring assertiveness.
  • Model choice matters: Llama varies most with tone while GPT varies least.
  • The released PLUM corpus supplies a reusable test set for studying politeness in multilingual LLM interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt design for production systems may need separate tuning for each target language rather than a single polite template.
  • Training data that over-represents one cultural politeness norm could explain why some models respond more strongly than others.
  • The same experimental setup could be extended to additional languages or newer model versions to test whether the observed language-specific patterns persist.

Load-bearing premise

That the eight-factor assessment framework provides a reliable, unbiased measure of response quality that generalizes across languages and models.

What would settle it

A new set of evaluations that finds no consistent difference in any of the eight quality scores between polite and impolite versions of the same prompts across all three languages and five models would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.16275 by Arjit Saxena, Garima Chhikara, Hitesh Mehta, Rohit Kumar.

Figure 1
Figure 1. Figure 1: CQS Performance across Politeness Categories and [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensitive (11.5% range), whereas GPT is more robust to adversarial tone. These results indicate that politeness is a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. To support reproducibility and future work, we additionally release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, and provide a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically assessed against the dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper investigates the effects of varying politeness levels in prompts on LLM response quality across English, Hindi, and Spanish using five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, Llama 3). It introduces the PLUM corpus of 1,500 human-validated prompts, generates 22,500 prompt-response pairs under raw/polite/impolite histories, and scores them on an eight-factor framework (coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, readability). Key claims are that polite prompts improve average quality by up to ~11% while impolite tones degrade it, but effects are language- and model-dependent (e.g., English favors courteous/direct, Hindi deferential/indirect, Spanish assertive; Llama shows 11.5% range while GPT is more robust), supported by analysis of six falsifiable hypotheses from politeness theory, with the corpus released for reproducibility.

Significance. If the evaluation pipeline is shown to be reliable and calibrated, the work provides concrete evidence that politeness is a measurable but non-universal variable in LLM behavior, with direct implications for multilingual prompt engineering and deployment. The public release of the PLUM corpus and the empirical assessment of politeness-theory hypotheses against a large cross-lingual dataset are clear strengths for reproducibility and theory-driven research.

major comments (3)
  1. [Evaluation Framework] The eight-factor assessment framework (coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, readability) is the sole basis for all quantitative results including the ~11% polite uplift, 11.5% Llama range, and language-specific tone preferences, yet the manuscript provides no details on whether scores come from human raters, LLM-as-judge prompts, or rule-based metrics, nor any inter-rater reliability statistics, agreement measures, or cross-lingual calibration against native speakers for Hindi and Spanish.
  2. [Methods] No information is given on prompt generation, human validation procedure for the 1,500 PLUM items, sampling of the 22,500 pairs, or statistical controls for multiple comparisons across languages, models, politeness levels, and histories; without these, the central claim that effects are 'neither consistent nor universal' cannot be evaluated for robustness.
  3. [Supplementary Analysis] The supplementary analysis of six falsifiable hypotheses from Brown & Levinson and Culpeper is presented as supporting evidence, but the manuscript does not report how the eight-factor scores were aggregated or tested against each hypothesis, leaving open whether the language- and model-specific patterns actually confirm or contradict the theoretical predictions.
minor comments (2)
  1. [Abstract] The abstract states 'three interaction histories between users (raw, polite, and impolite)' but does not clarify whether these modify the current prompt, prior turns, or both; this distinction should be made explicit with an example in the main text.
  2. [Corpus Release] While the PLUM corpus release is a positive step, the paper should include a dedicated paragraph specifying license, exact access URL or repository, file formats, and any usage restrictions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. Their comments have helped us identify areas for improvement in methodological transparency. We provide point-by-point responses to the major comments below, outlining the revisions we plan to make.

read point-by-point responses
  1. Referee: [Evaluation Framework] The eight-factor assessment framework (coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, readability) is the sole basis for all quantitative results including the ~11% polite uplift, 11.5% Llama range, and language-specific tone preferences, yet the manuscript provides no details on whether scores come from human raters, LLM-as-judge prompts, or rule-based metrics, nor any inter-rater reliability statistics, agreement measures, or cross-lingual calibration against native speakers for Hindi and Spanish.

    Authors: We thank the referee for this important comment. The manuscript indeed omits explicit details on the scoring methodology. We will revise the paper to include a clear description of how the eight-factor scores were generated: a combination of rule-based metrics for objective factors and LLM-as-judge for subjective ones, with human ratings on a subset for validation. We will also report inter-rater reliability statistics, agreement measures, and the cross-lingual calibration process involving native speakers for Hindi and Spanish. This addition will ensure transparency and allow proper evaluation of the quantitative results. revision: yes

  2. Referee: [Methods] No information is given on prompt generation, human validation procedure for the 1,500 PLUM items, sampling of the 22,500 pairs, or statistical controls for multiple comparisons across languages, models, politeness levels, and histories; without these, the central claim that effects are 'neither consistent nor universal' cannot be evaluated for robustness.

    Authors: We agree that the Methods section requires expansion for reproducibility. We will add detailed information on: the prompt generation process using theory-informed templates adapted per language; the human validation procedure for the 1,500 PLUM items, including the number of validators and criteria used; the sampling strategy for the 22,500 pairs; and the statistical methods employed, including controls for multiple comparisons (e.g., correction procedures). These revisions will support the robustness of our claim that effects are neither consistent nor universal. revision: yes

  3. Referee: [Supplementary Analysis] The supplementary analysis of six falsifiable hypotheses from Brown & Levinson and Culpeper is presented as supporting evidence, but the manuscript does not report how the eight-factor scores were aggregated or tested against each hypothesis, leaving open whether the language- and model-specific patterns actually confirm or contradict the theoretical predictions.

    Authors: We acknowledge the need for more explicit reporting on the supplementary analysis. We will update the manuscript to describe how the eight-factor scores were aggregated (e.g., into an overall quality metric) and the specific statistical tests used to evaluate each of the six hypotheses against the data. This will clarify the connection between the empirical findings and the theoretical predictions from Brown & Levinson and Culpeper. revision: yes

Circularity Check

0 steps flagged

Empirical measurements with no self-referential derivations or reductions

full rationale

The paper reports experimental results from querying LLMs with politeness-varied prompts across languages and models, then scoring outputs on a fixed eight-factor framework. No equations, predictions, fitted parameters, or uniqueness claims appear that reduce any reported effect (e.g., the ~11% polite uplift) to the inputs by construction. The work is self-contained: differences are computed directly from the described corpus and evaluation procedure, with no load-bearing self-citations or ansatzes that collapse the central findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of Brown and Levinson politeness theory and Culpeper impoliteness framework as applied to LLMs, plus the assumption that the eight scoring factors capture response quality. No free parameters are explicitly fitted in the abstract; the politeness categories and assessment factors are treated as given inputs.

axioms (1)
  • domain assumption Brown and Levinson Politeness Theory and Culpeper Impoliteness Framework accurately categorize prompt politeness levels in English, Hindi, and Spanish.
    Invoked in the abstract as the basis for the five politeness levels and experimental design.
invented entities (1)
  • PLUM corpus independent evidence
    purpose: Public dataset of 1,500 human-validated prompts across three languages and five politeness categories.
    New resource introduced to support reproducibility; independent evidence is the human validation step described.

pith-pipeline@v0.9.0 · 5645 in / 1235 out tokens · 53882 ms · 2026-05-10T08:42:27.873066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    LLMs in Education,

    L. Ng, “LLMs in Education,”XRDS: Crossroads, The ACM Magazine for Students, 2024

  2. [2]

    arXiv preprint arXiv:2409.11917 , year=

    B. Alhafni1, S. Vajjala, S. Bann `o, K. K. Maurya, and E. Kochmar, “LLMs in Education: Novel Perspectives, Challenges, and Opportuni- ties,”arXiv:2409.11917v1 [cs.CL] 18 Sep 2024, 2024

  3. [3]

    AssistantGPT: Enhancing User Interaction with LLM Integration,

    K. B. Neszl ´enyi, A. Milos, and A. Kiss, “AssistantGPT: Enhancing User Interaction with LLM Integration,”IEEE 22nd International Symposium on Intelligent Systems and Informatics, 2024

  4. [4]

    Using ChatGPT and other LLMs in Professional Environments,

    S. Alaswad, T. Kalganova, and W. S. Awad, “Using ChatGPT and other LLMs in Professional Environments,”Information Sciences Letters, vol. 12, no. 9, 2023, Available at: https://digitalcommons.aaru.edu.jo/ isl/vol12/iss9/17

  5. [5]

    Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions,

    Z. Abbasiantaeb, Y . Yuan, E. Kanoulas, and M. Aliannejadi, “Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM ’24). New York, NY , USA: Association for Computing Machinery, March 2024, 10 pages. Available at: htt...

  6. [6]

    GPT-4 Technical Report,

    OpenAI, “GPT-4 Technical Report,” OpenAI, Tech. Rep., 2023, Avail- able at: https://cdn.openai.com/papers/gpt-4.pdf

  7. [7]

    Google’s AI Search Answers Now Reach Over 1.5 Bil- lion Users Each Month,

    The Verge, “Google’s AI Search Answers Now Reach Over 1.5 Bil- lion Users Each Month,” 2025, https://www.theverge.com/news/655930/ google-q1-2025-earnings

  8. [8]

    ChatGPT Statistics (May 2025): Daily Active Users, Revenue, and More,

    DemandSage, “ChatGPT Statistics (May 2025): Daily Active Users, Revenue, and More,” 2025, https://www.demandsage.com/ chatgpt-statistics/

  9. [9]

    ChatGPT Statistics 2024: Daily Users, Revenue, and Fun Facts,

    NerdyNav, “ChatGPT Statistics 2024: Daily Users, Revenue, and Fun Facts,” 2024, https://nerdynav.com/chatgpt-statistics/

  10. [10]

    ChatGPT Statistics in 2024: Users, Revenue, & Market Share,

    Master of Code Global, “ChatGPT Statistics in 2024: Users, Revenue, & Market Share,” 2024, https://masterofcode.com/blog/chatgpt-statistics

  11. [11]

    D. Z. K ´ad´ar and M. Haugh,Understanding Politeness. Cambridge, United Kingdom: Cambridge University Press, 2013

  12. [12]

    Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Po- liteness on LLM Performance,

    Z. Yin, H. Wang, K. Horio, D. Kawahara, and S. Sekine, “Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Po- liteness on LLM Performance,” inProceedings of the Second Workshop on Social Influence in Conversations (SICon 2024). Association for Computational Linguistics, November 2024, pp. 9–35

  13. [13]

    Universals in Language Usage: Po- liteness Phenomena,

    P. Brown and S. C. Levinson, “Universals in Language Usage: Po- liteness Phenomena,”Questions and Politeness: Strategies in Social Interaction, pp. 56–289, 1978, in: Esther N. Goody (Ed.), Cambridge University Press

  14. [14]

    Brown and S

    P. Brown and S. C. Levinson,Politeness: Some Universals in Language Usage. Cambridge University Press, 1987

  15. [15]

    Towards an Anatomy of Impoliteness,

    J. Culpeper, “Towards an Anatomy of Impoliteness,”Journal of Prag- matics, vol. 25, no. 3, pp. 349–367, 1996

  16. [16]

    Culpeper,Impoliteness: Using Language to Cause Offence, ser

    J. Culpeper,Impoliteness: Using Language to Cause Offence, ser. Studies in Interactional Sociolinguistics. Cambridge and New York: Cambridge University Press, 2011, reviewed by Vitaly V oinov, SIL Electronic Book Reviews 2012-003

  17. [17]

    The Ethical Implications of Artificial Intelligence (AI) for Meaningful Work,

    S. Bankins and P. Formosa, “The Ethical Implications of Artificial Intelligence (AI) for Meaningful Work,”Journal of Business Ethics,

  18. [18]

    Available: https://doi.org/10.1007/s10551-023-05339-7

    [Online]. Available: https://doi.org/10.1007/s10551-023-05339-7

  19. [19]

    Ethics of AI in Radiology: A Review of Ethical and Societal Implications,

    M. Goisauf and M. C. Abad ´ıa, “Ethics of AI in Radiology: A Review of Ethical and Societal Implications,”Frontiers in Big Data, vol. 5, p. 850383, 2022, published: 14 July 2022. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fdata.2022.850383/full

  20. [20]

    How Rude Are You?: Evaluating Politeness and Affect in Interaction,

    S. Gupta, M. A. Walker, and D. M. Romano, “How Rude Are You?: Evaluating Politeness and Affect in Interaction,” inProceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII). Sheffield, UK: University of Sheffield, 2007

  21. [21]

    A computational approach to politeness with application to social factors,

    C. Danescu-Niculescu-Mizil, M. Sudhof, D. Jurafsky, J. Leskovec, and C. Potts, “A computational approach to politeness with application to social factors,” inProceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2013, pp. 250–259

  22. [22]

    Interpreting Neural Networks to Improve Politeness Comprehension,

    M. Aubakirova and M. Bansal, “Interpreting Neural Networks to Improve Politeness Comprehension,”arXiv preprint arXiv:1610.02683, 2016, Available at: https://arxiv.org/abs/1610.02683

  23. [23]

    If You Ask Nicely: A Digital Assistant Rebuking Impolite V oice Commands,

    M. Bonfert, M. Splieth ¨over, R. Arzaroli, M. Lange, M. Hanci, and R. Porzel, “If You Ask Nicely: A Digital Assistant Rebuking Impolite V oice Commands,” inProceedings of the 20th ACM International Conference on Multimodal Interaction (ICMI ’18). Boulder, CO, USA: ACM, 2018, Available at: https://doi.org/10.1145/3242969.3242995

  24. [24]

    CHI ’22: Proceedings of the 30 2022 CHI Conference on Human Factors in Computing Systems 307, 1–15 (2022) https://doi.org/10.1145/3491102.3517525

    Y . Hu, Y . Qu, A. Maus, and B. Mutlu, “Polite Or Direct? Conversation Design Of A Smart Display For Older Adults Based On Politeness Theory,” inProceedings Of The 2022 CHI Conference On Human Factors In Computing Systems (CHI ’22). New Orleans, LA, USA: ACM, 2022, Available at: https://doi.org/10.1145/3491102.3517525

  25. [25]

    https: //doi.org/10.1109/HRI53351.2022.9889592

    F. Babel, P. Hock, J. Kraus, and M. Baumann, “Human-Robot Conflict Resolution at an Elevator: The Effect of Robot Type, Request Politeness and Modality,” inProceedings of the 2022 17th ACM/IEEE Interna- tional Conference on Human-Robot Interaction (HRI). IEEE, 2022, Available at: https://doi.org/10.1109/HRI53351.2022.9889387

  26. [26]

    Do you say “please

    P. Hu, L. Wang, D. Chang, Z. Zhu, Y . Chen, Z. Ren, and H. Dumayi, “Do you say “please” to ChatGPT? Investigation of user perceptions of AI consciousness and their politeness performance in human-chatbot conversations,”International Journal of Human-Computer Interaction, 2026, published online 09 Jan 2026

  27. [27]

    Mind your manners: The dynamics of politeness in human-AI vs. human-human interactions,

    T. Lazebnik, L. Zalmanson, and O. Mokryn, “Mind your manners: The dynamics of politeness in human-AI vs. human-human interactions,” Proceedings of the ACM on Human-Computer Interaction, vol. 9, no. 7, pp. 1–22, 2025

  28. [28]

    The Role Of Politeness In Human–Machine Interactions: A Systematic Literature Review And Future Perspectives,

    P. Ribino, “The Role Of Politeness In Human–Machine Interactions: A Systematic Literature Review And Future Perspectives,”Artificial Intelligence Review, vol. 56, no. S1, pp. S445–S482, 2023, Available at: https://doi.org/10.1007/s10462-023-10540-1

  29. [29]

    Being Polite: Modeling Politeness Variation in a Personalized Dialog Agent,

    M. Firdaus, A. Shandilya, A. Ekbal, and P. Bhattacharyya, “Being Polite: Modeling Politeness Variation in a Personalized Dialog Agent,” IEEE Transactions on Computational Social Systems, vol. 10, 2023

  30. [30]

    How Well Can Language Models Understand Politeness?

    C. Li, B. Pang, W. Wang, L. Hu, M. Gordon, D. Marinova, B. Bal- ducci, and Y . Shang, “How Well Can Language Models Understand Politeness?” in2023 IEEE Conference on Artificial Intelligence (CAI), 2023, pp. 230–231

  31. [31]

    Can ChatGPT Recognize Impoliteness? An Exploratory Study of the Pragmatic Awareness of a Large Language Model,

    M. Andersson and D. McIntyre, “Can ChatGPT Recognize Impoliteness? An Exploratory Study of the Pragmatic Awareness of a Large Language Model,”Journal of Pragmatics, vol. 239, 2025. [Online]. Available: https://doi.org/10.1016/j.pragma.2025.02.001

  32. [32]

    Human–Computer Pragmatics Trialled: Some (Im)Polite Interactions With ChatGPT 4.0 And The Ensuing Implications,

    Z. Quan and Z. Chen, “Human–Computer Pragmatics Trialled: Some (Im)Polite Interactions With ChatGPT 4.0 And The Ensuing Implications,”Interactive Learning Environments, vol. 33, no. 2, pp. 1020–1039, 2024, Available at: https://doi.org/10.1080/10494820.2024.2362829

  33. [33]

    Cooking up politeness in human-AI information seeking dialogue,

    D. Elsweiler, C. Elsweiler, and A. Ziegner, “Cooking up politeness in human-AI information seeking dialogue,” inProceedings of the 2026 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR ’26). Seattle, W A, USA: ACM, 2026

  34. [34]

    The Impact of Question Styles on Response Characteristics in Dialogue with Generative AI: A Comparative Analysis of Polite and Direct Communication Patterns,

    K. Sato, “The Impact of Question Styles on Response Characteristics in Dialogue with Generative AI: A Comparative Analysis of Polite and Direct Communication Patterns,”Jxiv Preprints, 2024, preprint

  35. [35]

    The influence of prompt politeness on response quality in large language models,

    T. Zarra and R. Chiheb, “The influence of prompt politeness on response quality in large language models,” in2025 International Conference on Circuit, Systems and Communication (ICCSC). IEEE, 2025

  36. [36]

    Many Faces Of A Chatbot: The Use Of Positive And Negative Politeness Strategies In Argumentative Communication With A Chatbot,

    G. Ivkovi ´c, “Many Faces Of A Chatbot: The Use Of Positive And Negative Politeness Strategies In Argumentative Communication With A Chatbot,”Journal of Language and Literary Studies, no. 49, pp. 157– 176, 2024, Available at: https://doi.org/10.31902/fll.49.2024.9

  37. [37]

    Comparing human and LLM politeness strategies in free production,

    H. Zhao and R. D. Hawkins, “Comparing human and LLM politeness strategies in free production,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2025, pp. 16 188–16 216

  38. [38]

    Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language We Prompt Them In,

    U. Agarwal, K. Tanmay, A. Khandelwal, and M. Choudhury, “Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language We Prompt Them In,”arXiv Preprints, 2024, Available at: https://arxiv.org/pdf/2404.18460

  39. [39]

    Exploring Inconsistencies in AI Language Generation: A Comprehensive Case Study of ChatGPT,

    K. Allan, “Exploring Inconsistencies in AI Language Generation: A Comprehensive Case Study of ChatGPT,” 2023

  40. [40]

    An in-depth look at gemini’s language abilities,

    S. N. Akter, Z. Yu, A. Muhamed, T. Ou, A. B ¨auerle, ´Angel Alexan- der Cabrera, K. Dholakia, C. Xiong, and G. Neubig, “An In-depth Look at Gemini’s Language Abilities,”arXiv Preprint, 2023, Available at: https://arxiv.org/abs/2312.11444

  41. [41]

    Abul Ala Moududi, and MD

    A. Rahman, S. H. Mahir, M. T. A. Tashrif, A. A. Aishi, M. A. Karim, D. Kundu, T. Debnath, M. A. A. Moududi, and M. Z. A. Eidmum, “Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects,”arXiv Preprint, 2025, Available at: https://arxiv.org/abs/2503.04783

  42. [42]

    Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond,

    J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 4, pp. 160:1–160:35, 2024

  43. [43]

    A Com parison of DeepSeek and Other LLMs,

    T. Gao, J. Jin, Z. T. Ke, and G. Moryoussef, “A Comparison of DeepSeek and Other LLMs,”arXiv Preprint, 2025, Available at: https: //arxiv.org/abs/2502.03688

  44. [44]

    Gemini Models,

    Google DeepMind, “Gemini Models,” https://deepmind.google/ technologies/gemini/, 2024, accessed: October 2025

  45. [45]

    GPT-4o and GPT-4o Mini Models,

    OpenAI, “GPT-4o and GPT-4o Mini Models,” https://openai.com/ research/gpt-4o, 2024, accessed: October 2025

  46. [46]

    Claude 3.7 Sonnet Models,

    Anthropic, “Claude 3.7 Sonnet Models,” https://www.anthropic.com/ claude, 2025, accessed: October 2025

  47. [47]

    DeepSeek-Chat Language Model,

    DeepSeek AI, “DeepSeek-Chat Language Model,” https://www. deepseek.com/, 2025, accessed: October 2025

  48. [48]

    Llama 3 Models,

    Meta AI, “Llama 3 Models,” https://ai.meta.com/llama/, 2024, accessed: October 2025

  49. [49]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Lawrence Erlbaum Associates, 1988. APPENDIXA SUPPLEMENTARY: FORMALHYPOTHESES, AXIOMS,ANDEMPIRICALPROOFS This section formalises the theoretical claims implicit in the experimental design and evaluates each against the empirical evidence gathered across 22,500 (1,500 prompts across ea...