Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Adnan Qayyum; Junaid Qadir; Mahdi Alkaeed; Muhammad Bilal; Nabeel Abo Kashreef

arxiv: 2605.30646 · v1 · pith:F3TOXWULnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Mahdi Alkaeed , Adnan Qayyum , Nabeel Abo Kashreef , Muhammad Bilal , Junaid Qadir This is my paper

Pith reviewed 2026-06-29 07:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical LLMssemantic stabilityprompt robustnessdomain-specific modelsnatural language inferencemedical question answeringLLM evaluation

0 comments

The pith

Domain specialization does not consistently improve clinical LLMs' robustness to meaning-preserving prompt changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework using natural language inference to verify that prompt variations truly preserve clinical meaning, addressing limitations of embedding similarity. It then measures how sensitive different LLMs are to such variations using new metrics on medical question-answering tasks. Results show that medical domain-specific models do not reliably outperform general models in producing consistent outputs for equivalent inputs. This matters because in healthcare, small wording changes should not alter diagnoses or recommendations. The evaluation covers 16 models from the same families.

Core claim

Using a semantic verification framework based on Natural Language Inference, refined by LLM-as-a-judge and clinical expert audit, the authors evaluate 16 open-source general-purpose and medical LLMs on reformulated prompts from DiagnosisQA and MedQA. They find that robustness differences between domain-specific models are mixed and highly model-dependent, meaning domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several domain-specific models rank among the most robust compared to their general-purpose counterparts, while strong general baselines remain competitive.

What carries the argument

NLI-based semantic verification framework that filters meaning-preserving prompt variations, combined with metrics MeaningPreserving Variation Sensitivity (MVS), confidence variation (ΔC), and Worst-Case Instability (WCI).

If this is right

Domain specialization in medical LLMs does not guarantee greater stability to rephrased inputs.
General-purpose LLMs can compete with or exceed domain-specific ones in semantic robustness.
Meaning-preserving variations must be carefully verified to avoid missing distinctions like negation or severity.
Model selection for clinical use should include robustness testing beyond accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings imply that factors beyond domain labels, such as specific training data or architecture choices, drive robustness.
Applying the verification framework to other domains like legal or financial LLMs could reveal similar patterns.
Real-world deployment might require ongoing monitoring of prompt sensitivity in live clinical settings.

Load-bearing premise

The NLI-based semantic verification framework, refined by an LLM-as-a-judge and audited by a clinical expert, correctly identifies meaning-preserving prompt variations without missing clinically important distinctions such as negation, temporality, or severity.

What would settle it

A clinical expert review finding that two prompts labeled as meaning-preserving by the framework actually differ in a key clinical detail like negation or severity, causing the model to change its diagnosis.

Figures

Figures reproduced from arXiv: 2605.30646 by Adnan Qayyum, Junaid Qadir, Mahdi Alkaeed, Muhammad Bilal, Nabeel Abo Kashreef.

**Figure 1.** Figure 1: Overview of the systematic framework for evaluating semantic stability in clinical LLMs through meaning-preserving [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Number of variations classified as semantically equivalent ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Robustness analysis across meaning-preserving variations for GP and DS LLMs. The heatmaps report the mean MVS [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between prediction instability and confidence instability under meaning-preserving prompt reformulations [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Confidence–accuracy alignment across various GP and DS models on DiagnosisQA and MedQA datasets. The dashed [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\Delta C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds domain specialization gives no consistent robustness gain against meaning-preserving rephrasings in clinical LLMs, using a new NLI-plus-judge pipeline and three metrics, but the equivalence filter lacks reported validation.

read the letter

The one thing to know is that this work tests 16 open-source LLMs on DiagnosisQA and MedQA reformulations and concludes that domain-specific models do not reliably outperform general ones on stability to semantically equivalent prompts. The second is that they built a verification stack (NLI filter, LLM-as-judge, single-expert audit) and three metrics (MVS, ΔC, WCI) to make that test possible.

The pipeline is a clear step past embedding similarity, which the abstract correctly notes fails on negation, temporality, and severity. Defining the three metrics directly from model outputs on verified pairs is straightforward and reproducible in principle. Evaluating matched GP and DS models at similar scales is also the right control.

The soft spot is exactly the one the stress-test flags: no numbers on how often the NLI+LLM+expert step keeps clinically non-equivalent pairs. The abstract mentions the framework but supplies no inter-rater numbers, no held-out error analysis on negation cases, and no false-positive rate. If even a modest share of retained pairs differ on severity or timing, the mixed robustness result could be an artifact. The abstract also gives the headline conclusion without any actual metric values or confidence intervals, so the claim is hard to assess from the given text.

This is for people working on safe clinical LLM deployment or prompt robustness. A reader who needs a concrete way to measure meaning-preserving variation will find usable pieces even if the validation is thin. It deserves peer review because the framing and metrics address a real deployment risk and the evaluation design is sensible; the verification gap is fixable with added checks rather than fatal.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a semantic verification framework that combines NLI filtering, an LLM-as-a-judge, and single-expert clinical audit to retain meaning-preserving prompt reformulations from DiagnosisQA and MedQA. It defines three sensitivity metrics (MVS, ΔC, WCI) and evaluates 16 open-source general-purpose and domain-specific LLMs from matched families and scales, reporting that robustness differences are mixed and model-dependent, so that domain specialization does not consistently improve or reduce stability to semantically equivalent rephrasings.

Significance. If the retained pairs are verifiably meaning-preserving at the level of clinical inference, the mixed DS/GP pattern would indicate that medical fine-tuning alone does not reliably mitigate prompt-sensitivity risks in safety-critical settings. The use of public datasets, same-family controls, and standard NLI tools supports reproducibility and allows direct comparison of specialization effects.

major comments (1)

[Methods (semantic verification framework)] Methods (semantic verification framework): the NLI + LLM-judge + expert-audit pipeline is described but supplies no quantitative validation—no inter-rater agreement on the expert audit, no error analysis or false-positive rates on held-out negation/temporality/severity cases, and no reported performance of the NLI+LLM filter on clinical text. Because the central claim rests on the retained pairs being clinically equivalent, the absence of these checks leaves open the possibility that observed robustness differences are artifacts of inconsistent labels rather than model behavior.

minor comments (2)

[Abstract] Abstract: reports the high-level conclusion and method but contains no numerical results, key statistics, or dataset sizes, making it difficult for readers to gauge effect magnitude without reading the full text.
[Results] Results section: when presenting the mixed DS/GP ranking, include a table that explicitly pairs each DS model with its GP counterpart (same family and scale) so that the model-dependent claim can be inspected at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will make revisions to strengthen the validation of the semantic verification framework.

read point-by-point responses

Referee: Methods (semantic verification framework): the NLI + LLM-judge + expert-audit pipeline is described but supplies no quantitative validation—no inter-rater agreement on the expert audit, no error analysis or false-positive rates on held-out negation/temporality/severity cases, and no reported performance of the NLI+LLM filter on clinical text. Because the central claim rests on the retained pairs being clinically equivalent, the absence of these checks leaves open the possibility that observed robustness differences are artifacts of inconsistent labels rather than model behavior.

Authors: We agree that quantitative validation of the NLI + LLM-judge + expert-audit pipeline is necessary to substantiate that retained pairs are clinically equivalent. In the revised manuscript we will add: (1) performance metrics (e.g., accuracy or F1) of the NLI+LLM filter evaluated on a held-out set of clinical examples that explicitly cover negation, temporality, and severity; (2) an error analysis that reports false-positive rates for meaning-preserving classifications; and (3) an expanded description of the single-expert audit protocol together with an explicit discussion of this design choice as a study limitation. These additions will directly address the concern that robustness differences could be artifacts of label inconsistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics defined directly from outputs on public data

full rationale

The paper applies standard NLI tools and an LLM judge to public DiagnosisQA/MedQA data to filter prompt pairs, then computes three new metrics (MVS, ΔC, WCI) directly from the resulting model outputs. No parameters are fitted to the target robustness conclusion, no self-citation chain supports the central empirical claim, and the reported mixed DS/GP robustness pattern is an observation rather than a definitional or fitted consequence. The framework is proposed and executed; results remain falsifiable against external clinical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that NLI models can reliably detect clinical equivalence; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Embedding-based similarity metrics fail to capture distinctions involving negation, temporality, or severity
Stated as the key limitation that motivates the NLI approach.
domain assumption NLI models plus LLM-as-judge plus clinical expert can accurately filter meaning-preserving variations
Central premise of the proposed verification framework.

pith-pipeline@v0.9.1-grok · 5796 in / 1301 out tokens · 41061 ms · 2026-06-29T07:12:19.665061+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Application of large language models in medicine,

F. Liu, H. Zhou, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, Y . Hua, P. Zhouet al., “Application of large language models in medicine,”Nature Reviews Bioengineering, vol. 3, no. 6, pp. 445–464, 2025

2025
[2]

The reliability of LLMs for medical diagnosis: An exam- ination of consistency, manipulation, and contextual awareness,

K. Subedi, “The reliability of LLMs for medical diagnosis: An exam- ination of consistency, manipulation, and contextual awareness,”arXiv preprint arXiv:2503.10647, 2025

work page arXiv 2025
[3]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohlet al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

2023
[4]

Toward expert- level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature medicine, vol. 31, no. 3, pp. 943–950, 2025

2025
[5]

Llm sensitivity evaluation framework for clinical diagnosis,

C. Yan, X. Fu, Y . Xiong, T. Wang, S. C. Hui, J. Wu, and X. Liu, “Llm sensitivity evaluation framework for clinical diagnosis,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 3083–3094

2025
[6]

Evaluation and mitigation of the limitations of large language models in clinical decision-making,

P. Hager, F. Jungmann, R. Holland, K. Bhagat, I. Hubrecht, M. Knauer, J. Vielhauer, M. Makowski, R. Braren, G. Kaissis, and D. Rueckert, “Evaluation and mitigation of the limitations of large language models in clinical decision-making,”Nature Medicine, vol. 30, no. 9, pp. 2613– 2622, 2024

2024
[7]

On the worst prompt performance of large language models,

B. Cao, D. Cai, Z. Zhang, Y . Zou, and W. Lam, “On the worst prompt performance of large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 69 022–69 042, 2024

2024
[8]

Semantically equivalent adversarial rules for debugging NLP models,

M. T. Ribeiro, S. Singh, and C. Guestrin, “Semantically equivalent adversarial rules for debugging NLP models,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 856–865

2018
[9]

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment,

D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is BERT really robust? a strong baseline for natural language attack on text classification and entailment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8018–8025

2020
[10]

Reevaluating adversarial examples in natural language,

J. X. Morris, E. Lifland, J. Lanchantin, Y . Ji, and Y . Qi, “Reevaluating adversarial examples in natural language,” inFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, 2020

2020
[11]

Tailor: Generating and perturbing text with semantic controls,

A. Ross, T. Wu, H. Peng, M. E. Peters, and M. Gardner, “Tailor: Generating and perturbing text with semantic controls,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3194–3213. 14

2022
[12]

Beyond accuracy: Behavioral testing of NLP models with CheckList,

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of NLP models with CheckList,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 4902– 4912

2020
[13]

Adaptive testing and debugging of NLP models,

M. T. Ribeiro and S. Lundberg, “Adaptive testing and debugging of NLP models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3253– 3267

2022
[14]

Polyjuice: Generat- ing counterfactuals for explaining, evaluating, and improving models,

T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld, “Polyjuice: Generat- ing counterfactuals for explaining, evaluating, and improving models,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Comp...

2021
[15]

Paws: Paraphrase adversaries from word scrambling,

Y . Zhang, J. Baldridge, and L. He, “Paws: Paraphrase adversaries from word scrambling,” inNAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, 2019

2019
[16]

Measuring and improving consistency in pretrained language models,

Y . Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Sch¨utze, and Y . Goldberg, “Measuring and improving consistency in pretrained language models,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021

2021
[17]

Evaluating paraphrastic robustness in textual entailment models,

D. Verma and A. Poliak, “Evaluating paraphrastic robustness in textual entailment models,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. Online: Association for Computational Linguistics, 2021, pp. 351–358

2021
[18]

Paraphrasus: A comprehensive benchmark for evaluating paraphrase detection models,

A. Michail, S. Clematide, and J. Opitz, “Paraphrasus: A comprehensive benchmark for evaluating paraphrase detection models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 8749–8762

2025
[19]

MENLI: Robust evaluation metrics from natural language inference,

Y . Chen and S. Eger, “MENLI: Robust evaluation metrics from natural language inference,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 804–825, 2023

2023
[20]

SummaC: Re-visiting NLI-based models for inconsistency detection in summa- rization,

P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “SummaC: Re-visiting NLI-based models for inconsistency detection in summa- rization,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

2022
[21]

Enhancing self-consistency and performance of pre-trained language models through natural language inference,

E. Mitchell, J. Noh, S. Li, W. Armstrong, A. Agarwal, P. Liu, C. Finn, and C. D. Manning, “Enhancing self-consistency and performance of pre-trained language models through natural language inference,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational L...

2022
[22]

Lessons from natural language inference in the clinical domain,

A. Romanov and C. Shivade, “Lessons from natural language inference in the clinical domain,” inProceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1586–1596

2018
[23]

State of what art? a call for multi-prompt LLM eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of what art? a call for multi-prompt LLM eval- uation,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 933–949, 2024

2024
[24]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,”arXiv preprint arXiv:2310.11324, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” in Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, 2023, pp. 57–68

2023
[26]

Limitations of large language models in clinical problem-solving aris- ing from inflexible reasoning,

J. Kim, A. Podlasek, K. Shidara, F. Liu, A. Alaa, and D. Bernardo, “Limitations of large language models in clinical problem-solving aris- ing from inflexible reasoning,”Scientific Reports, vol. 15, 2025

2025
[27]

A systematic review of large language model (llm) evaluations in clinical medicine,

S. Shool, S. Adimi, R. Saboori Amleshi, E. Bitaraf, R. Golpira, and M. Tara, “A systematic review of large language model (llm) evaluations in clinical medicine,”BMC Medical Informatics and Decision Making, vol. 25, no. 1, p. 117, 2025

2025
[28]

Testing and evaluation of health care applications of large language models: a systematic review,

S. Bedi, Y . Liu, L. Orr-Ewing, D. Dash, S. Koyejo, A. Callahan, J. A. Fries, M. Wornow, A. Swaminathan, L. S. Lehmannet al., “Testing and evaluation of health care applications of large language models: a systematic review,”Jama, vol. 333, no. 4, pp. 319–328, 2025

2025
[29]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inEMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th Interna- tional Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019

2019
[30]

Medsim: A novel seman- tic similarity measure in bio-medical knowledge graphs,

K. Lei, K. Yuan, Q. Zhang, and Y . Shen, “Medsim: A novel seman- tic similarity measure in bio-medical knowledge graphs,” inInterna- tional conference on knowledge science, engineering and management. Springer, 2018, pp. 479–490

2018
[31]

Semantics at an angle: When cosine similarity works until it doesn’t,

K. You, “Semantics at an angle: When cosine similarity works until it doesn’t,”arXiv preprint arXiv:2504.16318, 2025

work page arXiv 2025
[32]

Is cosine-similarity of embed- dings really about similarity?

H. Steck, C. Ekanadham, and N. Kallus, “Is cosine-similarity of embed- dings really about similarity?” inCompanion Proceedings of the ACM Web Conference 2024, 2024, pp. 887–890

2024
[33]

Does prompt formatting have any impact on llm performance?

J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, and S. Hasan, “Does prompt formatting have any impact on llm performance?”arXiv preprint arXiv:2411.10541, 2024

work page arXiv 2024
[34]

This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

H. S. Yun, G. Kapoor, M. Mackert, R. Kouzy, W. Xu, J. J. Li, and B. C. Wallace, “This treatment works, right? evaluating llm sensitivity to pa- tient question framing in medical qa,”arXiv preprint arXiv:2604.05051, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering,

F. Errica, D. Sanvito, G. Siracusano, and R. Bifulco, “What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 1543–1558

2025
[36]

Prosa: Assessing and understanding the prompt sensitivity of llms,

J. Zhuo, S. Zhang, X. Fang, H. Duan, D. Lin, and K. Chen, “Prosa: Assessing and understanding the prompt sensitivity of llms,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 1950–1976

2024
[37]

Promptception: How sensitive are large multimodal models to prompts?

M. I. Ismithdeen, M. U. Khattak, and S. Khan, “Promptception: How sensitive are large multimodal models to prompts?”arXiv preprint arXiv:2509.03986, 2025

work page arXiv 2025
[38]

Evaluating the zero-shot robustness of instruction-tuned language models,

J. Sun, C. Shaib, and B. C. Wallace, “Evaluating the zero-shot robustness of instruction-tuned language models,”arXiv preprint arXiv:2306.11270, 2023

work page arXiv 2023
[39]

Revisiting nli: Towards cost-effective and human-aligned metrics for evaluating llms in question answering,

S. S. Balamurali and L. Cheng, “Revisiting nli: Towards cost-effective and human-aligned metrics for evaluating llms in question answering,” arXiv preprint arXiv:2511.07659, 2025

work page arXiv 2025
[40]

Improving paraphrase detection with the adversarial paraphrasing task,

A. Nighojkar and J. Licato, “Improving paraphrase detection with the adversarial paraphrasing task,” inACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. 1, 2021

2021
[41]

Improving medical nli using context-aware domain knowledge,

S. Chowdhury, S. Y . Philip, and Y . Luo, “Improving medical nli using context-aware domain knowledge,” inProceedings of the Ninth Joint Conference on Lexical and Computational Semantics, 2020, pp. 1–11

2020
[42]

A comprehensive survey on the trustworthiness of large language models in healthcare,

M. Aljohani, J. Hou, S. Kommu, and X. Wang, “A comprehensive survey on the trustworthiness of large language models in healthcare,”arXiv preprint arXiv:2502.15871, 2025

work page arXiv 2025

[1] [1]

Application of large language models in medicine,

F. Liu, H. Zhou, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, Y . Hua, P. Zhouet al., “Application of large language models in medicine,”Nature Reviews Bioengineering, vol. 3, no. 6, pp. 445–464, 2025

2025

[2] [2]

The reliability of LLMs for medical diagnosis: An exam- ination of consistency, manipulation, and contextual awareness,

K. Subedi, “The reliability of LLMs for medical diagnosis: An exam- ination of consistency, manipulation, and contextual awareness,”arXiv preprint arXiv:2503.10647, 2025

work page arXiv 2025

[3] [3]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohlet al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

2023

[4] [4]

Toward expert- level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert- level medical question answering with large language models,”Nature medicine, vol. 31, no. 3, pp. 943–950, 2025

2025

[5] [5]

Llm sensitivity evaluation framework for clinical diagnosis,

C. Yan, X. Fu, Y . Xiong, T. Wang, S. C. Hui, J. Wu, and X. Liu, “Llm sensitivity evaluation framework for clinical diagnosis,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 3083–3094

2025

[6] [6]

Evaluation and mitigation of the limitations of large language models in clinical decision-making,

P. Hager, F. Jungmann, R. Holland, K. Bhagat, I. Hubrecht, M. Knauer, J. Vielhauer, M. Makowski, R. Braren, G. Kaissis, and D. Rueckert, “Evaluation and mitigation of the limitations of large language models in clinical decision-making,”Nature Medicine, vol. 30, no. 9, pp. 2613– 2622, 2024

2024

[7] [7]

On the worst prompt performance of large language models,

B. Cao, D. Cai, Z. Zhang, Y . Zou, and W. Lam, “On the worst prompt performance of large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 69 022–69 042, 2024

2024

[8] [8]

Semantically equivalent adversarial rules for debugging NLP models,

M. T. Ribeiro, S. Singh, and C. Guestrin, “Semantically equivalent adversarial rules for debugging NLP models,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 856–865

2018

[9] [9]

Is BERT really robust? a strong baseline for natural language attack on text classification and entailment,

D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is BERT really robust? a strong baseline for natural language attack on text classification and entailment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8018–8025

2020

[10] [10]

Reevaluating adversarial examples in natural language,

J. X. Morris, E. Lifland, J. Lanchantin, Y . Ji, and Y . Qi, “Reevaluating adversarial examples in natural language,” inFindings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, 2020

2020

[11] [11]

Tailor: Generating and perturbing text with semantic controls,

A. Ross, T. Wu, H. Peng, M. E. Peters, and M. Gardner, “Tailor: Generating and perturbing text with semantic controls,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3194–3213. 14

2022

[12] [12]

Beyond accuracy: Behavioral testing of NLP models with CheckList,

M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of NLP models with CheckList,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 4902– 4912

2020

[13] [13]

Adaptive testing and debugging of NLP models,

M. T. Ribeiro and S. Lundberg, “Adaptive testing and debugging of NLP models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3253– 3267

2022

[14] [14]

Polyjuice: Generat- ing counterfactuals for explaining, evaluating, and improving models,

T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld, “Polyjuice: Generat- ing counterfactuals for explaining, evaluating, and improving models,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Comp...

2021

[15] [15]

Paws: Paraphrase adversaries from word scrambling,

Y . Zhang, J. Baldridge, and L. He, “Paws: Paraphrase adversaries from word scrambling,” inNAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, 2019

2019

[16] [16]

Measuring and improving consistency in pretrained language models,

Y . Elazar, N. Kassner, S. Ravfogel, A. Ravichander, E. Hovy, H. Sch¨utze, and Y . Goldberg, “Measuring and improving consistency in pretrained language models,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 1012–1031, 2021

2021

[17] [17]

Evaluating paraphrastic robustness in textual entailment models,

D. Verma and A. Poliak, “Evaluating paraphrastic robustness in textual entailment models,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop. Online: Association for Computational Linguistics, 2021, pp. 351–358

2021

[18] [18]

Paraphrasus: A comprehensive benchmark for evaluating paraphrase detection models,

A. Michail, S. Clematide, and J. Opitz, “Paraphrasus: A comprehensive benchmark for evaluating paraphrase detection models,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 8749–8762

2025

[19] [19]

MENLI: Robust evaluation metrics from natural language inference,

Y . Chen and S. Eger, “MENLI: Robust evaluation metrics from natural language inference,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 804–825, 2023

2023

[20] [20]

SummaC: Re-visiting NLI-based models for inconsistency detection in summa- rization,

P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “SummaC: Re-visiting NLI-based models for inconsistency detection in summa- rization,”Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

2022

[21] [21]

Enhancing self-consistency and performance of pre-trained language models through natural language inference,

E. Mitchell, J. Noh, S. Li, W. Armstrong, A. Agarwal, P. Liu, C. Finn, and C. D. Manning, “Enhancing self-consistency and performance of pre-trained language models through natural language inference,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational L...

2022

[22] [22]

Lessons from natural language inference in the clinical domain,

A. Romanov and C. Shivade, “Lessons from natural language inference in the clinical domain,” inProceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1586–1596

2018

[23] [23]

State of what art? a call for multi-prompt LLM eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of what art? a call for multi-prompt LLM eval- uation,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 933–949, 2024

2024

[24] [24]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,”arXiv preprint arXiv:2310.11324, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Gonget al., “Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts,” in Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, 2023, pp. 57–68

2023

[26] [26]

Limitations of large language models in clinical problem-solving aris- ing from inflexible reasoning,

J. Kim, A. Podlasek, K. Shidara, F. Liu, A. Alaa, and D. Bernardo, “Limitations of large language models in clinical problem-solving aris- ing from inflexible reasoning,”Scientific Reports, vol. 15, 2025

2025

[27] [27]

A systematic review of large language model (llm) evaluations in clinical medicine,

S. Shool, S. Adimi, R. Saboori Amleshi, E. Bitaraf, R. Golpira, and M. Tara, “A systematic review of large language model (llm) evaluations in clinical medicine,”BMC Medical Informatics and Decision Making, vol. 25, no. 1, p. 117, 2025

2025

[28] [28]

Testing and evaluation of health care applications of large language models: a systematic review,

S. Bedi, Y . Liu, L. Orr-Ewing, D. Dash, S. Koyejo, A. Callahan, J. A. Fries, M. Wornow, A. Swaminathan, L. S. Lehmannet al., “Testing and evaluation of health care applications of large language models: a systematic review,”Jama, vol. 333, no. 4, pp. 319–328, 2025

2025

[29] [29]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inEMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th Interna- tional Joint Conference on Natural Language Processing, Proceedings of the Conference, 2019

2019

[30] [30]

Medsim: A novel seman- tic similarity measure in bio-medical knowledge graphs,

K. Lei, K. Yuan, Q. Zhang, and Y . Shen, “Medsim: A novel seman- tic similarity measure in bio-medical knowledge graphs,” inInterna- tional conference on knowledge science, engineering and management. Springer, 2018, pp. 479–490

2018

[31] [31]

Semantics at an angle: When cosine similarity works until it doesn’t,

K. You, “Semantics at an angle: When cosine similarity works until it doesn’t,”arXiv preprint arXiv:2504.16318, 2025

work page arXiv 2025

[32] [32]

Is cosine-similarity of embed- dings really about similarity?

H. Steck, C. Ekanadham, and N. Kallus, “Is cosine-similarity of embed- dings really about similarity?” inCompanion Proceedings of the ACM Web Conference 2024, 2024, pp. 887–890

2024

[33] [33]

Does prompt formatting have any impact on llm performance?

J. He, M. Rungta, D. Koleczek, A. Sekhon, F. X. Wang, and S. Hasan, “Does prompt formatting have any impact on llm performance?”arXiv preprint arXiv:2411.10541, 2024

work page arXiv 2024

[34] [34]

This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

H. S. Yun, G. Kapoor, M. Mackert, R. Kouzy, W. Xu, J. J. Li, and B. C. Wallace, “This treatment works, right? evaluating llm sensitivity to pa- tient question framing in medical qa,”arXiv preprint arXiv:2604.05051, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering,

F. Errica, D. Sanvito, G. Siracusano, and R. Bifulco, “What did i do wrong? quantifying llms’ sensitivity and consistency to prompt engineering,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 1543–1558

2025

[36] [36]

Prosa: Assessing and understanding the prompt sensitivity of llms,

J. Zhuo, S. Zhang, X. Fang, H. Duan, D. Lin, and K. Chen, “Prosa: Assessing and understanding the prompt sensitivity of llms,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 1950–1976

2024

[37] [37]

Promptception: How sensitive are large multimodal models to prompts?

M. I. Ismithdeen, M. U. Khattak, and S. Khan, “Promptception: How sensitive are large multimodal models to prompts?”arXiv preprint arXiv:2509.03986, 2025

work page arXiv 2025

[38] [38]

Evaluating the zero-shot robustness of instruction-tuned language models,

J. Sun, C. Shaib, and B. C. Wallace, “Evaluating the zero-shot robustness of instruction-tuned language models,”arXiv preprint arXiv:2306.11270, 2023

work page arXiv 2023

[39] [39]

Revisiting nli: Towards cost-effective and human-aligned metrics for evaluating llms in question answering,

S. S. Balamurali and L. Cheng, “Revisiting nli: Towards cost-effective and human-aligned metrics for evaluating llms in question answering,” arXiv preprint arXiv:2511.07659, 2025

work page arXiv 2025

[40] [40]

Improving paraphrase detection with the adversarial paraphrasing task,

A. Nighojkar and J. Licato, “Improving paraphrase detection with the adversarial paraphrasing task,” inACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, vol. 1, 2021

2021

[41] [41]

Improving medical nli using context-aware domain knowledge,

S. Chowdhury, S. Y . Philip, and Y . Luo, “Improving medical nli using context-aware domain knowledge,” inProceedings of the Ninth Joint Conference on Lexical and Computational Semantics, 2020, pp. 1–11

2020

[42] [42]

A comprehensive survey on the trustworthiness of large language models in healthcare,

M. Aljohani, J. Hou, S. Kommu, and X. Wang, “A comprehensive survey on the trustworthiness of large language models in healthcare,”arXiv preprint arXiv:2502.15871, 2025

work page arXiv 2025