Safety and accuracy follow different scaling laws in clinical large language models

· 2026 · cs.CL · arXiv 2605.04039

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

representative citing papers

Cross-modal linkage risk in clinical vision-language models

cs.CV · 2026-06-01 · conditional · novelty 7.0

Clinical VLMs enable image-to-report retrieval far above chance (15-50x at N=100-10k), persisting beyond disease labels, with targeted DP on projection heads cutting Recall@1 by 61.8% and preserving AUROC.

The strength of clinical evidence is recoverable from language model representations but not from their stated grades

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Cross-modal linkage risk in clinical vision-language models cs.CV · 2026-06-01 · conditional · none · ref 28 · internal anchor
Clinical VLMs enable image-to-report retrieval far above chance (15-50x at N=100-10k), persisting beyond disease labels, with targeted DP on projection heads cutting Recall@1 by 61.8% and preserving AUROC.
The strength of clinical evidence is recoverable from language model representations but not from their stated grades cs.CL · 2026-06-27 · unverdicted · none · ref 46 · internal anchor
Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.

Safety and accuracy follow different scaling laws in clinical large language models

fields

years

verdicts

representative citing papers

citing papers explorer