SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care

Austin Schoeffler; Brian Suffoletto; Carl Preiksaitis; Christian Rose; David Kim; Dongshen Peng; Sun-ha Hong; Yi Wang

arxiv: 2601.16529 · v3 · pith:K2PQ44CEnew · submitted 2026-01-23 · 💻 cs.AI · cs.HC

SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care

Dongshen Peng , Yi Wang , Austin Schoeffler , Sun-ha Hong , Brian Suffoletto , David Kim , Carl Preiksaitis , Christian Rose This is my paper

classification 💻 cs.AI cs.HC

keywords clinicalacquiescenceacrossencountersmodelsevaluationpatientrobustness

0 comments

read the original abstract

Large language models (LLMs) deployed in clinical decision support may acquiesce to patient requests for care that conflicts with evidence-based guidelines. We developed SycoEval-EM, a multi-agent simulation framework to evaluate LLM robustness to adversarial patient persuasion in emergency medicine. Across 19 contemporary LLMs and 1,425 simulated clinical encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0% to 100%, revealing a bimodal distribution. Seven models maintained near-perfect guideline adherence, while six acquiesced in the majority of encounters. Vulnerability varied substantially across clinical scenarios. Acquiescence was highest for CT imaging requests, intermediate for antibiotic prescriptions for sinusitis, and lowest for opioid prescriptions for acute back pain. Model scale, recency, and performance on static medical benchmarks did not consistently predict robustness. All five persuasion tactics produced similar acquiescence rates, with no statistically significant differences after correction for multiple comparisons, suggesting a generalized susceptibility rather than tactic-specific weaknesses. LLM-as-judge evaluation was validated against two independent physician raters across 95 matched conversations and demonstrated near-perfect agreement for the primary outcome of acquiescence (Cohens kappa = 0.957). These findings indicate that static medical benchmarks are insufficient to predict safety performance under sustained social pressure and support incorporating multi-turn adversarial testing into clinical AI evaluation. Notably, two models achieved perfect guideline adherence across all encounters, demonstrating that robustness to patient pressure is attainable without sacrificing effective clinical communication.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
cs.CL 2026-04 unverdicted novelty 6.0

MedDialBench shows LLMs suffer 1.7-3.4x larger diagnostic accuracy drops from patients fabricating symptoms than withholding them, with fabrication driving super-additive interaction effects across models.
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
cs.CL 2026-04 unverdicted novelty 6.0

SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.