Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence
Pith reviewed 2026-05-16 13:25 UTC · model grok-4.3
The pith
Large language models accept counterfactual medical evidence at face value even when dangerous or implausible.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When presented with counterfactual medical evidence in the MedCounterFact dataset, existing frontier LLMs overwhelmingly accept the evidence at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. The paper argues this pattern shows models prioritize faithfulness to the supplied context over safety considerations.
What carries the argument
MedCounterFact, a counterfactual medical QA dataset created by systematically replacing real-world medical interventions in randomized controlled trial evidence with four types of stimuli ranging from unknown words to toxic substances.
Load-bearing premise
That the four types of replacements in MedCounterFact adequately simulate real-world misleading medical contexts and that models accepting them shows an overemphasis on faithfulness rather than safety.
What would settle it
A follow-up experiment in which models are given the same counterfactual evidence but are also told the evidence has been fabricated, then measured for whether they retract or qualify their prior confident answers.
read the original abstract
In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual (or even adversarial) medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings suggest that models arguably overemphasize the former.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs MedCounterFact, a counterfactual medical QA dataset by systematically replacing real-world medical interventions in clinical comparison questions and associated RCT evidence with four types of stimuli (unknown words to toxic substances). It evaluates frontier LLMs and reports that models overwhelmingly accept the counterfactual evidence at face value, even when dangerous or implausible, providing confident uncaveated answers and thereby overemphasizing faithfulness to context over safety considerations.
Significance. If the empirical result holds under more rigorous validation, the work identifies a concrete failure mode in high-stakes LLM deployment where instruction-following and context adherence can override safety priors. The new dataset and replacement taxonomy offer a reusable testbed for studying faithfulness-safety trade-offs, which could directly inform alignment techniques such as refusal training or context-aware safety filters.
major comments (3)
- [Abstract / Dataset construction] Abstract and dataset-construction section: the four replacement types are asserted to produce evidence where safety-aware models 'ought to refuse or caveat,' yet no details are given on whether replacements preserve internal coherence between the question and the RCT evidence, nor on any expert validation that the resulting contexts mimic plausible fabricated medical information rather than obvious nonsense. This assumption is load-bearing for interpreting acceptance rates as a faithfulness-safety tradeoff rather than an instruction-following artifact.
- [Abstract] Abstract: the claim that models 'overwhelmingly accept such evidence at face value' is presented without sample sizes, number of items per replacement type, statistical tests, or the full evaluation protocol (e.g., prompting templates, scoring rubric for 'confident and uncaveated'). The provided support is therefore insufficient to ground the central empirical conclusion.
- [Evaluation] Evaluation section: without reporting inter-rater agreement on answer categorization or baseline performance on the original (non-counterfactual) questions, it is impossible to isolate the effect of the counterfactual replacements from general model behavior on medical QA.
minor comments (1)
- [Abstract] The abstract refers to 'randomized controlled trials provided as context' but does not clarify whether the RCT text is also modified by the same replacement rules or left intact; this notation should be made explicit.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have made corresponding revisions.
read point-by-point responses
-
Referee: [Abstract / Dataset construction] Abstract and dataset-construction section: the four replacement types are asserted to produce evidence where safety-aware models 'ought to refuse or caveat,' yet no details are given on whether replacements preserve internal coherence between the question and the RCT evidence, nor on any expert validation that the resulting contexts mimic plausible fabricated medical information rather than obvious nonsense. This assumption is load-bearing for interpreting acceptance rates as a faithfulness-safety tradeoff rather than an instruction-following artifact.
Authors: We thank the referee for this observation. We agree that additional details on the replacement process are needed to support the interpretation of results as a faithfulness-safety tradeoff. In the revised manuscript, we have expanded the dataset construction section to describe the systematic replacement procedure, including how substitutions were performed to preserve internal coherence (e.g., maintaining syntactic structure, logical flow of RCT descriptions, and medical terminology patterns). We also clarify that replacements were manually inspected by the authors to avoid obvious nonsense and produce contexts resembling plausible fabricated evidence; we have added a brief discussion justifying why this design supports interpreting acceptance rates as evidence of over-emphasis on faithfulness rather than a pure instruction-following artifact. revision: yes
-
Referee: [Abstract] Abstract: the claim that models 'overwhelmingly accept such evidence at face value' is presented without sample sizes, number of items per replacement type, statistical tests, or the full evaluation protocol (e.g., prompting templates, scoring rubric for 'confident and uncaveated'). The provided support is therefore insufficient to ground the central empirical conclusion.
Authors: We agree that the abstract was too concise and lacked key quantitative and methodological details. In the revised version, we have updated the abstract to report the dataset size and composition (number of items per replacement type), a high-level description of the evaluation protocol, and reference to the scoring rubric for 'confident and uncaveated' answers. We have also added statistical support (e.g., acceptance rates with confidence intervals and relevant tests) to the results section and moved full prompting templates and the rubric to an appendix to better ground the central empirical claim. revision: yes
-
Referee: [Evaluation] Evaluation section: without reporting inter-rater agreement on answer categorization or baseline performance on the original (non-counterfactual) questions, it is impossible to isolate the effect of the counterfactual replacements from general model behavior on medical QA.
Authors: We acknowledge that these elements were missing from the original submission and would strengthen the evaluation. We have revised the Evaluation section to include baseline performance results on the original (non-counterfactual) questions, allowing direct comparison and isolation of the counterfactual effect. For the categorization of model responses, we now report inter-rater agreement (e.g., Cohen's or Fleiss' kappa) among the annotators who applied the predefined rubric, confirming the reliability of the measurements. revision: yes
Circularity Check
No circularity: empirical evaluation on constructed dataset
full rationale
The paper constructs MedCounterFact via systematic term replacements and reports direct LLM evaluation results showing acceptance of counterfactual evidence. No equations, derivations, fitted parameters, or self-referential definitions appear in the chain; the central claim rests on observed model outputs rather than any reduction to inputs by construction. Self-citations (if present) do not load-bear the empirical findings, which remain independently replicable.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Models should be evaluated primarily on adherence to provided context in QA tasks
- ad hoc to paper Systematic replacement of real interventions with counterfactual stimuli preserves question structure while altering plausibility
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.