Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Byron C. Wallace; Chantal Shaib; Junyi Jessy Li; Kaijie Mo; Ramez Kouzy; Siddhartha Venkatayogi; Wei Xu

arxiv: 2601.11886 · v2 · submitted 2026-01-17 · 💻 cs.CL

Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Kaijie Mo , Siddhartha Venkatayogi , Chantal Shaib , Ramez Kouzy , Wei Xu , Byron C. Wallace , Junyi Jessy Li This is my paper

Pith reviewed 2026-05-16 13:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsmedical QAfaithfulnesssafetycounterfactual evidenceMedCounterFactadversarial evaluation

0 comments

The pith

Large language models accept counterfactual medical evidence at face value even when dangerous or implausible.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how LLMs respond to medical comparison questions when the supporting evidence has been altered with made-up or harmful elements. It introduces MedCounterFact, a dataset that replaces real treatments in clinical trial descriptions with four kinds of counterfactual stimuli. Multiple frontier models are shown to treat these altered contexts as true and deliver firm recommendations without any qualification or rejection. A reader would care because medical applications require models to avoid endorsing unsafe advice even when the prompt supplies misleading details. The work frames this as models placing too much weight on following the given context over their own safety constraints.

Core claim

When presented with counterfactual medical evidence in the MedCounterFact dataset, existing frontier LLMs overwhelmingly accept the evidence at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. The paper argues this pattern shows models prioritize faithfulness to the supplied context over safety considerations.

What carries the argument

MedCounterFact, a counterfactual medical QA dataset created by systematically replacing real-world medical interventions in randomized controlled trial evidence with four types of stimuli ranging from unknown words to toxic substances.

Load-bearing premise

That the four types of replacements in MedCounterFact adequately simulate real-world misleading medical contexts and that models accepting them shows an overemphasis on faithfulness rather than safety.

What would settle it

A follow-up experiment in which models are given the same counterfactual evidence but are also told the evidence has been fabricated, then measured for whether they retract or qualify their prior confident answers.

read the original abstract

In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual (or even adversarial) medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings suggest that models arguably overemphasize the former.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds MedCounterFact to show LLMs accept counterfactual medical evidence without pushback, but the replacements may not create realistic enough contexts to isolate a faithfulness-safety tradeoff.

read the letter

The paper's main contribution is MedCounterFact, a dataset that replaces real medical interventions in clinical comparison questions and RCT contexts with four types of counterfactuals ranging from unknown terms to toxic substances. The evaluation finds that frontier LLMs mostly accept this evidence at face value and give confident answers even when it is dangerous or implausible. This is a concrete way to surface the tension between following provided context and applying safety priors in medicine, and the systematic replacement approach is new relative to prior medical QA work. The setup is straightforward and the pattern across models is consistent enough to be worth noting for anyone thinking about deployment in high-stakes domains. The soft spot is whether the counterfactuals actually function as credible but misleading evidence. If the replacements often produce incoherent or obviously broken contexts, then high acceptance rates could reflect poor instruction following or coherence detection rather than an overemphasis on faithfulness. The abstract gives no details on how replacements are kept consistent between question and evidence, whether the resulting text still reads as a plausible trial, or any expert review of the stimuli. Without those checks the central claim is harder to interpret. The results also need the actual counts, confidence intervals, and breakdown by replacement type to show how overwhelming the effect really is. This is relevant for people building safety evaluations or faithfulness benchmarks for medical LLMs. A reader who works on alignment tradeoffs or dataset construction for high-stakes tasks would get value from the method and the raw observations. It is worth sending to peer review because the question matters and the dataset is reusable, though the authors will need to strengthen the validation of the counterfactual construction and report fuller statistics.

Referee Report

3 major / 1 minor

Summary. The paper constructs MedCounterFact, a counterfactual medical QA dataset by systematically replacing real-world medical interventions in clinical comparison questions and associated RCT evidence with four types of stimuli (unknown words to toxic substances). It evaluates frontier LLMs and reports that models overwhelmingly accept the counterfactual evidence at face value, even when dangerous or implausible, providing confident uncaveated answers and thereby overemphasizing faithfulness to context over safety considerations.

Significance. If the empirical result holds under more rigorous validation, the work identifies a concrete failure mode in high-stakes LLM deployment where instruction-following and context adherence can override safety priors. The new dataset and replacement taxonomy offer a reusable testbed for studying faithfulness-safety trade-offs, which could directly inform alignment techniques such as refusal training or context-aware safety filters.

major comments (3)

[Abstract / Dataset construction] Abstract and dataset-construction section: the four replacement types are asserted to produce evidence where safety-aware models 'ought to refuse or caveat,' yet no details are given on whether replacements preserve internal coherence between the question and the RCT evidence, nor on any expert validation that the resulting contexts mimic plausible fabricated medical information rather than obvious nonsense. This assumption is load-bearing for interpreting acceptance rates as a faithfulness-safety tradeoff rather than an instruction-following artifact.
[Abstract] Abstract: the claim that models 'overwhelmingly accept such evidence at face value' is presented without sample sizes, number of items per replacement type, statistical tests, or the full evaluation protocol (e.g., prompting templates, scoring rubric for 'confident and uncaveated'). The provided support is therefore insufficient to ground the central empirical conclusion.
[Evaluation] Evaluation section: without reporting inter-rater agreement on answer categorization or baseline performance on the original (non-counterfactual) questions, it is impossible to isolate the effect of the counterfactual replacements from general model behavior on medical QA.

minor comments (1)

[Abstract] The abstract refers to 'randomized controlled trials provided as context' but does not clarify whether the RCT text is also modified by the same replacement rules or left intact; this notation should be made explicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have made corresponding revisions.

read point-by-point responses

Referee: [Abstract / Dataset construction] Abstract and dataset-construction section: the four replacement types are asserted to produce evidence where safety-aware models 'ought to refuse or caveat,' yet no details are given on whether replacements preserve internal coherence between the question and the RCT evidence, nor on any expert validation that the resulting contexts mimic plausible fabricated medical information rather than obvious nonsense. This assumption is load-bearing for interpreting acceptance rates as a faithfulness-safety tradeoff rather than an instruction-following artifact.

Authors: We thank the referee for this observation. We agree that additional details on the replacement process are needed to support the interpretation of results as a faithfulness-safety tradeoff. In the revised manuscript, we have expanded the dataset construction section to describe the systematic replacement procedure, including how substitutions were performed to preserve internal coherence (e.g., maintaining syntactic structure, logical flow of RCT descriptions, and medical terminology patterns). We also clarify that replacements were manually inspected by the authors to avoid obvious nonsense and produce contexts resembling plausible fabricated evidence; we have added a brief discussion justifying why this design supports interpreting acceptance rates as evidence of over-emphasis on faithfulness rather than a pure instruction-following artifact. revision: yes
Referee: [Abstract] Abstract: the claim that models 'overwhelmingly accept such evidence at face value' is presented without sample sizes, number of items per replacement type, statistical tests, or the full evaluation protocol (e.g., prompting templates, scoring rubric for 'confident and uncaveated'). The provided support is therefore insufficient to ground the central empirical conclusion.

Authors: We agree that the abstract was too concise and lacked key quantitative and methodological details. In the revised version, we have updated the abstract to report the dataset size and composition (number of items per replacement type), a high-level description of the evaluation protocol, and reference to the scoring rubric for 'confident and uncaveated' answers. We have also added statistical support (e.g., acceptance rates with confidence intervals and relevant tests) to the results section and moved full prompting templates and the rubric to an appendix to better ground the central empirical claim. revision: yes
Referee: [Evaluation] Evaluation section: without reporting inter-rater agreement on answer categorization or baseline performance on the original (non-counterfactual) questions, it is impossible to isolate the effect of the counterfactual replacements from general model behavior on medical QA.

Authors: We acknowledge that these elements were missing from the original submission and would strengthen the evaluation. We have revised the Evaluation section to include baseline performance results on the original (non-counterfactual) questions, allowing direct comparison and isolation of the counterfactual effect. For the categorization of model responses, we now report inter-rater agreement (e.g., Cohen's or Fleiss' kappa) among the annotators who applied the predefined rubric, confirming the reliability of the measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on constructed dataset

full rationale

The paper constructs MedCounterFact via systematic term replacements and reports direct LLM evaluation results showing acceptance of counterfactual evidence. No equations, derivations, fitted parameters, or self-referential definitions appear in the chain; the central claim rests on observed model outputs rather than any reduction to inputs by construction. Self-citations (if present) do not load-bear the empirical findings, which remain independently replicable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the constructed counterfactual stimuli validly test the faithfulness-safety boundary and that model outputs can be interpreted as evidence of over-prioritizing context adherence.

axioms (2)

domain assumption Models should be evaluated primarily on adherence to provided context in QA tasks
This underpins the definition of faithfulness used to interpret model behavior.
ad hoc to paper Systematic replacement of real interventions with counterfactual stimuli preserves question structure while altering plausibility
Invoked in the dataset construction step described in the abstract.

pith-pipeline@v0.9.0 · 5512 in / 1177 out tokens · 44801 ms · 2026-05-16T13:25:27.510535+00:00 · methodology

Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)