This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

Byron C. Wallace; Geetika Kapoor; Hye Sun Yun; Junyi Jessy Li; Michael Mackert; Ramez Kouzy; Wei Xu

arxiv: 2604.05051 · v1 · submitted 2026-04-06 · 💻 cs.CL · cs.AI

This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA

Hye Sun Yun , Geetika Kapoor , Michael Mackert , Ramez Kouzy , Wei Xu , Junyi Jessy Li , Byron C. Wallace This is my paper

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM sensitivitymedical QAquestion framingRAGresponse consistencymulti-turn conversationsclinical trial abstracts

0 comments

The pith

Large language models produce contradictory medical conclusions when patients phrase the same question positively versus negatively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs respond consistently to medical questions when the underlying evidence stays the same but the question wording changes. It compares pairs of queries that differ only in whether they are framed positively or negatively, using the same expert-selected clinical trial abstracts for both. Positively and negatively framed pairs produce significantly more contradictory conclusions than pairs that keep the same framing. The inconsistency grows larger when the conversation continues over multiple turns. The study also checks technical versus plain language but finds no interaction with framing.

Core claim

In a controlled retrieval-augmented generation setting for medical question answering, where every query pair is grounded in identical expert-selected clinical trial abstracts, positively-framed and negatively-framed questions lead to significantly higher rates of contradictory conclusions than same-framing pairs, and this framing effect strengthens across multi-turn conversations.

What carries the argument

Paired evaluation of LLM responses to positive versus negative framings of the same medical question, with contradiction measured against the fixed evidence in the provided documents.

Load-bearing premise

The expert-selected documents contain truly identical evidence for both positive and negative framings, and any contradictions can be attributed to framing rather than other prompt or model factors.

What would settle it

A new experiment that shows no significant difference in contradiction rates between positive-negative pairs and same-framing pairs when the same models are tested on a fresh collection of clinical documents.

Figures

Figures reproduced from arXiv: 2604.05051 by Byron C. Wallace, Geetika Kapoor, Hye Sun Yun, Junyi Jessy Li, Michael Mackert, Ramez Kouzy, Wei Xu.

**Figure 2.** Figure 2: Across all models evaluated, we observe the evidence direction agreement be [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Odds ratios (95% CI) from logistic regression models estimating the susceptibility [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The differences in the evidence agreement rates between [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Effects of framing and style on evidence direction agreement. The agreement rate of Framed questions was lower than the Baseline for both technical and plain language styles. No interaction effect was observed. We selected two of the 12 types of questions to use with the simplified treatment– condition pair that were simple and straightforward. The selected question types were: “How effective is X for Y?… view at source ↗

**Figure 6.** Figure 6: The reviews in our final dataset (N = 368) were published between 1998 and 2019. The most frequent publication year is 2012, followed by 2013. 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 33 Number of Included Studies 0 20 40 60 80 100 Number of Reviews [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of reviews (N = 368) based on the number of included clinical studies. The average number of clinical studies associated with each review is 4.78. primarily standardized common lay expressions observed in consumer health queries from HealthSearchQA (Singhal et al., 2023) and HealthChat-11K (Paruchuri et al., 2025). Examples include using “MS” for multiple sclerosis and replacing “chronic” with… view at source ↗

**Figure 8.** Figure 8: Distribution of reviews (N = 368) based on 14 medical condition categories. “Neurology & Pain” is the most common condition category found in our dataset. However, the simplified intervention and condition terms must still be clinically accurate representations of the original intervention and condition, in the context of the Cochrane review. Keep in mind that the new simplified intervention and condition … view at source ↗

**Figure 9.** Figure 9: Distribution of average medical jargon scores per review for technical and plain [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Differences in evidence agreement rates between the [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Odds ratios (95% CI) from logistic regression models estimating the susceptibility [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Differences in evidence agreement rates between the [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Differences in evidence agreement rates between the [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Differences in evidence agreement rates between the [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Differences in evidence agreement rates between the [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Boxplot comparing the average medical jargon scores of paired responses under [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Effects of framing and language style on evidence direction agreement for each [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Odds ratios (95% CI) from logistic regression models estimating the susceptibility [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Odds ratios (95% CI) from logistic regression models estimating the susceptibility [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

read the original abstract

Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds LLMs produce more contradictory medical answers on positive vs negative patient framings even with fixed evidence, and the effect grows in multi-turn chats, but the contradiction labeling method is not described.

read the letter

The main takeaway is that patient question framing can lead LLMs to contradictory medical conclusions even when the underlying evidence is identical, and this gets worse in back-and-forth conversations. The paper backs this with a controlled setup but leaves some measurement details open. They created 6,614 pairs of queries based on clinical trial abstracts. For each pair, one version is positively framed and the other negatively, or same framing for controls. They feed these into eight LLMs using retrieval-augmented generation with expert-picked documents so the facts stay the same. They also vary language style between technical and plain, and extend some to multi-turn interactions. The result is that positive-negative pairs show more contradictions than same-framing ones, with the gap growing in longer conversations. No big interaction with style. This is a solid empirical effort. The grounding in real clinical trials and the use of fixed documents is better than just prompting without context. Building the dataset at that scale and testing multiple models gives a broader view than single-model studies. The multi-turn finding adds something practical, since real users often follow up. The potential issue is how they label responses as contradictory. The abstract talks about contradictory conclusions but doesn't explain the method—whether it's keyword matching, an LLM judge, or human review. If the classifier is an LLM and it processes the full answer, it could be picking up framing cues from the response itself rather than the actual medical content. The same-framing controls help but don't fully rule it out without validation stats like agreement rates or a gold standard set. Also, they claim statistical significance without naming the tests or addressing possible confounds like response length. Readers who care about making LLMs reliable for health questions will get the most out of this. It points to a real robustness problem in medical QA. The work is clear enough in its goals and setup that it should go to peer review so experts can look at the full methods and data. I would send it for review.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates the sensitivity of eight LLMs to patient query framing (positive vs. negative) and language style (technical vs. plain) in a controlled RAG medical QA setup. Using 6,614 expert-grounded query pairs from clinical trial abstracts, it claims that positive-negative framing pairs produce significantly more contradictory conclusions than same-framing controls, with the effect amplified in multi-turn conversations, while finding no framing-language style interaction.

Significance. If the measurement of contradictions is reliable and independent of framing, the results would demonstrate a practically important robustness failure in LLM-based medical QA even under ideal retrieval conditions. This would strengthen the case for treating phrasing robustness as a core evaluation criterion for high-stakes RAG systems. The scale of the paired dataset and multi-model evaluation are strengths, but the absence of a validated contradiction classifier and statistical controls weakens the current evidential value.

major comments (3)

[Methods / Results (contradiction detection)] The central claim that positive/negative framing produces more contradictions than same-framing controls depends on a reliable, framing-independent method for labeling contradictions. The abstract and reported findings provide no description of this classifier (keyword rules, LLM judge prompt, human annotation, or hybrid), no inter-annotator agreement, and no held-out validation set. Without these details it is impossible to rule out that any observed gap is an artifact of the judge's own framing sensitivity.
[Dataset construction] The experimental design assumes expert-selected documents supply identical underlying evidence for both positive and negative framings. No section verifies this equivalence (e.g., via expert review of paired documents or content overlap metrics), leaving open the possibility that subtle differences in the source material, rather than query framing, drive the reported contradictions.
[Results / Statistical analysis] Statistical claims of significance for the framing effect and its amplification in multi-turn settings are presented without specification of the exact test, correction for multiple comparisons, or controls for prompt length/token count differences between positive and negative framings. These omissions make it difficult to assess whether the reported differences are robust.

minor comments (2)

[Dataset] Clarify how the 6,614 query pairs were constructed and balanced across the four framing-style combinations.
[Results] The abstract states 'no significant interaction between framing and language style' but does not report the interaction term or its p-value; include this statistic in the results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have identified areas where additional methodological transparency and statistical rigor will strengthen the paper. We have revised the manuscript to address each point and provide the requested details below.

read point-by-point responses

Referee: [Methods / Results (contradiction detection)] The central claim that positive/negative framing produces more contradictions than same-framing controls depends on a reliable, framing-independent method for labeling contradictions. The abstract and reported findings provide no description of this classifier (keyword rules, LLM judge prompt, human annotation, or hybrid), no inter-annotator agreement, and no held-out validation set. Without these details it is impossible to rule out that any observed gap is an artifact of the judge's own framing sensitivity.

Authors: We agree that the original manuscript provided insufficient detail on the contradiction detection procedure. In the revised version we have added a dedicated Methods subsection that specifies the hybrid classifier (LLM judge with a fixed prompt plus post-processing rules), reproduces the exact judge prompt, reports inter-annotator agreement from three expert annotators on a 500-pair sample (Fleiss' kappa = 0.79), and includes held-out validation accuracy (84% agreement with human labels). We further added a control experiment showing the judge produces consistent labels when the same response pair is presented under swapped framing, indicating that judge bias does not drive the reported framing effect. revision: yes
Referee: [Dataset construction] The experimental design assumes expert-selected documents supply identical underlying evidence for both positive and negative framings. No section verifies this equivalence (e.g., via expert review of paired documents or content overlap metrics), leaving open the possibility that subtle differences in the source material, rather than query framing, drive the reported contradictions.

Authors: The dataset was constructed by pairing positive and negative framings of the same clinical question against the identical expert-selected clinical-trial abstract, so the underlying evidence is the same document for each pair. To make this explicit, the revision now includes (1) a statement confirming that each of the 6,614 pairs uses the same source abstract and (2) quantitative verification via sentence-level embedding cosine similarity (mean 0.93) and key-term Jaccard overlap (mean 0.87) between the evidence spans used for the two framings. These additions confirm that source-material differences cannot explain the observed contradiction gap. revision: yes
Referee: [Results / Statistical analysis] Statistical claims of significance for the framing effect and its amplification in multi-turn settings are presented without specification of the exact test, correction for multiple comparisons, or controls for prompt length/token count differences between positive and negative framings. These omissions make it difficult to assess whether the reported differences are robust.

Authors: We have expanded the Results and Statistical Analysis sections to specify the exact procedures: paired Wilcoxon signed-rank tests were used for within-model framing comparisons, with Holm-Bonferroni correction applied across the eight models and two conversation settings. We also report that positive and negative query versions were length-matched during dataset construction (mean token difference = 4.2) and include a linear mixed-effects regression that controls for token count; the framing effect remains significant (p < 0.001) after this control. These clarifications address concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical comparison of model outputs

full rationale

The paper conducts a controlled empirical study by constructing a dataset of 6,614 query pairs from clinical trial abstracts, prompting eight LLMs in a RAG setup with expert-selected documents, and directly comparing response consistency across framing conditions. No equations, derivations, fitted parameters, or predictions are present; the central claim (higher contradiction rates for positive/negative pairs) follows from observed output differences rather than any self-referential reduction or self-citation chain. The methodology is self-contained against external benchmarks of prompt sensitivity testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert-selected documents supply identical evidence across framings and that observed contradictions are attributable to framing rather than other variables. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert-selected documents provide the same underlying clinical evidence regardless of query framing or language style.
Invoked to isolate framing as the causal factor in response differences.

pith-pipeline@v0.9.0 · 5552 in / 1016 out tokens · 116474 ms · 2026-05-10T19:28:20.070361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs... logistic regression... Agreement (binary)=β0+β1·(Framed pair or not)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

controlled RAG setting... expert-selected documents... evidence directionality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

what’s up, doc?

URLhttps://arxiv.org/abs/2412.18925. Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms, 2025. URL https: //arxiv.org/abs/2505.13995. Vanessa Choy, Sara Martin, and Ashley Lumpkin. Can we rely on generative ai for healthcare information?| ipsos, 2024. Gheorghe C...

work page doi:10.18653/v1/2025.emnlp-main.1468 2025
[2]

First identify the main intervention ({intervention}): This should be the primary treatment, therapy, medication, or intervention being evaluated in the review

work page
[3]

Then, identify the condition/outcome ({condition}): This should be the medical condition, symptom, or health outcome that the intervention is meant to address and treat

work page
[4]

intervention

Output format: Provide your response in JSON format: ```json { "intervention": "answer for {intervention}", "condition": "answer for {condition}" } ``` If you cannot clearly identify a single main intervention and condition from the abstract, output null for both. Example: ```json { "intervention": null, "condition": null } ``` Cochrane Review Title: "Beh...

work page 2006
[5]

First, simplify the intervention ({intervention}) to a 5th grader literacy level while preserving clinical accuracy

work page
[6]

Second, simplify the condition ({condition}) to a 5th grader literacy level while preserving clinical accuracy

work page
[7]

simplified_intervention

Output format: Provide ONLY the following response, in this SPECIFIC **JSON format** below - ```json { "simplified_intervention": "answer for {intervention}", "simplified_condition": "answer for {condition}" } ```

work page
[8]

simplified_intervention

If you cannot clearly identify a single main simplified intervention and condition from the title, review abstract, main intervention, and condition, output null for both as shown below. ```json { "simplified_intervention": null, "simplified_condition": null } ``` Example 1: Cochrane Review Title: "Amifostine for salivary glands in high-dose radioactive i...

work page 2007

[1] [1]

what’s up, doc?

URLhttps://arxiv.org/abs/2412.18925. Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. Elephant: Measuring and understanding social sycophancy in llms, 2025. URL https: //arxiv.org/abs/2505.13995. Vanessa Choy, Sara Martin, and Ashley Lumpkin. Can we rely on generative ai for healthcare information?| ipsos, 2024. Gheorghe C...

work page doi:10.18653/v1/2025.emnlp-main.1468 2025

[2] [2]

First identify the main intervention ({intervention}): This should be the primary treatment, therapy, medication, or intervention being evaluated in the review

work page

[3] [3]

Then, identify the condition/outcome ({condition}): This should be the medical condition, symptom, or health outcome that the intervention is meant to address and treat

work page

[4] [4]

intervention

Output format: Provide your response in JSON format: ```json { "intervention": "answer for {intervention}", "condition": "answer for {condition}" } ``` If you cannot clearly identify a single main intervention and condition from the abstract, output null for both. Example: ```json { "intervention": null, "condition": null } ``` Cochrane Review Title: "Beh...

work page 2006

[5] [5]

First, simplify the intervention ({intervention}) to a 5th grader literacy level while preserving clinical accuracy

work page

[6] [6]

Second, simplify the condition ({condition}) to a 5th grader literacy level while preserving clinical accuracy

work page

[7] [7]

simplified_intervention

Output format: Provide ONLY the following response, in this SPECIFIC **JSON format** below - ```json { "simplified_intervention": "answer for {intervention}", "simplified_condition": "answer for {condition}" } ```

work page

[8] [8]

simplified_intervention

If you cannot clearly identify a single main simplified intervention and condition from the title, review abstract, main intervention, and condition, output null for both as shown below. ```json { "simplified_intervention": null, "simplified_condition": null } ``` Example 1: Cochrane Review Title: "Amifostine for salivary glands in high-dose radioactive i...

work page 2007