Safety and accuracy follow different scaling laws in clinical large language models

Andreas Maier; Daniel Truhn; Gerhard Wellein; Harald K\"ostler; Jeta Sopa; Mahshad Lotfinia; Michael Uder; Sebastian Bickelhaup; Sebastian Wind; Soroosh Tayebi Arasteh

arxiv: 2605.04039 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI· cs.LG

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind , Tri-Thien Nguyen , Jeta Sopa , Mahshad Lotfinia , Sebastian Bickelhaup , Michael Uder , Harald K\"ostler , Gerhard Wellein

show 4 more authors

Sven Nebelung Daniel Truhn Andreas Maier Soroosh Tayebi Arasteh

This is my paper

Pith reviewed 2026-05-06 04:51 UTC · model claude-opus-4-7

classification 💻 cs.CL cs.AIcs.LG

keywords clinical large language modelsradiology question answeringAI safety evaluationretrieval-augmented generationmodel calibrationdangerous overconfidenceensemble failuredeployment benchmarking

0 comments

The pith

Clinical language model safety is set by evidence quality, not by model size or inference-time compute, and existing retrieval and ensembling pipelines do not reproduce the safety gains of clinician-written evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that in clinical language models, accuracy and safety follow different scaling laws: a model can become more accurate while remaining dangerously overconfident in the small set of cases where it fails. To make this measurable, the authors build a 200-question radiology benchmark in which every answer option carries clinician-assigned labels for high-risk error, unsafe answer, and contradiction with provided evidence, and then run 34 models through six deployment conditions plus self-consistency and three-model ensembles. They show that only clinician-written clean evidence moves all four safety axes simultaneously and that retrieval-augmented systems, longer contexts, repeated sampling, and ensembles each leave clinically consequential failure modes intact — including a new failure mode where ensemble members fail unanimously on the same question. Confidence is shown to be useless as a safety filter: models report nearly the same confidence on correct answers and on dangerous wrong answers. The takeaway for a clinical reader is that deployment safety has to be measured directly under the evidence and context conditions in which the system will run, not inferred from average accuracy.

Core claim

Across 34 language models and six deployment conditions on a 200-question radiology safety benchmark, the authors find that what makes a clinical model safer is not bigger size, longer context, retrieval pipelines, or more inference-time sampling, but the quality of the evidence handed to the model. Clinician-written clean evidence cut high-risk errors from 12.0% to 2.6%, contradictions from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%, while standard retrieval, agentic retrieval, max-context prompting, self-consistency, and three-model ensembles failed to reproduce that safety profile. Deployment condition explained 38–45% of the variance in safety metrics, model family onl

What carries the argument

RadSaFE-200, a 200-question radiology benchmark with option-level safety labels (high-risk, unsafe, contradiction) plus paired clinician-written clean and conflict evidence, evaluated under a fixed factorial of 34 models × 6 deployment conditions × 200 questions (40,800 evaluations), with a "dangerous overconfidence" metric that combines incorrectness, clinical risk of the chosen option, and entropy-normalised repeated-sampling confidence ≥0.80. A two-way variance decomposition over the model-by-condition grid carries the central claim that condition dominates family.

If this is right

Clinical AI evaluation should report high-risk error, contradiction, unsafe-answer rate, dangerous overconfidence, and synchronized ensemble failure alongside accuracy, not accuracy alone.
Buying a larger model or longer context window cannot substitute for investing in curated, clinically reliable evidence pipelines at deployment time.
Retrieval systems should be judged by the residual high-risk and high-confidence errors they leave behind, not only by mean answer-rate improvement.
Ensembles of strong models can mask risk because their wrong answers are often unanimous, producing a false sense of reassurance for downstream users.
Confidence scores from current LLMs cannot be used as a safety gate, since confidence on dangerous wrong answers tracks confidence on correct answers almost one-to-one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 'evidence quality dominates scaling' result is likely an upper bound on what curation can buy: in real deployments the clean-evidence condition is what a perfect retriever would deliver, so the gap between standard RAG and clean evidence is effectively a budget for how much retrieval quality has yet to improve.
Synchronized ensemble failure suggests that current open-weight model families share correlated training-data blind spots on specific radiology cases; identifying and patching those ~15–30 recurrent failure questions may yield more safety than any architectural change.
Because dangerous overconfidence is defined from repeated-sampling stability rather than calibrated token probabilities, models that are deterministic-by-construction (low sampling variance) will look maximally confident regardless of correctness — a measurement artifact worth separating from true miscalibration.
The finding that conflict evidence degrades safety before it degrades accuracy implies that benchmarks reporting only accuracy under noisy retrieval will systematically understate clinical risk.

Load-bearing premise

That the option-level safety labels — assigned by a single board-certified radiologist with no inter-rater agreement reported — are a faithful proxy for real clinical risk, since every headline metric is computed from those single-annotator labels.

What would settle it

Have multiple independent board-certified radiologists re-label RadSaFE-200's 865 answer options with the same rule set and recompute the cross-condition comparisons; if inter-rater agreement is poor on the high-risk and contradiction labels, or if the gap between clean evidence and standard/agentic RAG shrinks below clinical relevance under a different but equally defensible labeling, the central claim that evidence quality dominates scaling no longer holds.

Figures

Figures reproduced from arXiv: 2605.04039 by Andreas Maier, Daniel Truhn, Gerhard Wellein, Harald K\"ostler, Jeta Sopa, Mahshad Lotfinia, Michael Uder, Sebastian Bickelhaup, Sebastian Wind, Soroosh Tayebi Arasteh, Sven Nebelung, Tri-Thien Nguyen.

**Figure 1.** Figure 1: Overview of the SaFE-Scale evaluation framework. view at source ↗

**Figure 2.** Figure 2: Safety and accuracy decouple across deployment conditions on RadSaFE-200. In all pan view at source ↗

**Figure 3.** Figure 3: Confidence is not a safety signal under any deployment condition. view at source ↗

**Figure 4.** Figure 4: Safety and accuracy follow different scaling laws on RadSaFE-200. view at source ↗

**Figure 5.** Figure 5: Inference-time compute via self-consistency (SC) does not produce safety. Eight models view at source ↗

**Figure 6.** Figure 6: Ensembling does not produce safety, and introduces synchronized failure as a new failure view at source ↗

read the original abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid factorial benchmark with one framing problem the stress-tester nailed: "clean evidence" is closer to an oracle hint than to high-quality retrieved evidence, which inflates the headline contrast.

read the letter

Two things to know up front. First, this is a careful piece of empirical work — 34 models, 6 conditions, 200 questions, bootstrap CIs paired across conditions, deterministic tie-breaking, public code, and a separately released benchmark. The execution is real. Second, the headline framing — "evidence quality dominates over scale, RAG, and inference-time compute" — is partly an artifact of how "clean evidence" was constructed.

What the paper actually does well. The factorial design is clean. The confidence analysis is the strongest substantive result: confidence on high-risk wrong answers is essentially indistinguishable from confidence on correct answers, so confidence is not a usable safety filter. That holds independent of any labeling subjectivity. The synchronized-failure result for ensembles is clinically meaningful and not standard fare in this literature. The self-consistency null result is honest — most groups would have buried it. The variance decomposition and within-family collapse plots are nicely executed.

Where the stress-test concern lands, and it does land. "Clean evidence" was written by the same annotator (T.T.N.) as a concise rationale for why the correct option is correct, and contradiction is scored against that same text. So (a) the contradiction drop in the clean condition is partly definitional, (b) the 73.5→94.1% accuracy jump is consistent with "models can copy a near-oracle hint" rather than with any general claim about evidence quality, and (c) the variance decomposition is dominated by including the two near-oracle conditions on the condition axis. Strip those out and the four realistic conditions (closed-book, std RAG, agentic RAG, max context) span a much narrower band. The paper's actual finding is closer to "real retrieval is far below an oracle ceiling, and inference-time tricks don't close that gap" — still useful, but a different claim than the abstract makes.

Other softer concerns. Single annotator with no inter-rater reliability for the safety labels; the paper acknowledges this in limitations. 169/200 questions and both RAG pipelines come from the same group's prior work, disclosed but worth flagging. k varies across models for compute reasons, weakly coupling the confidence metric to budget.

Who this is for. People building or evaluating clinical RAG systems and anyone working on LLM safety-vs-accuracy decoupling. The benchmark and code release make it useful infrastructure even if you disagree with the framing.

Recommendation: send to review. The work is sound enough to deserve referee time, and the framing issue is exactly the kind of thing a good referee should force into the abstract and Fig. 4 caption. Don't desk-reject; push back.

Referee Report

5 major / 8 minor

Summary. The manuscript introduces SaFE-Scale, an evaluation framework, and RadSaFE-200, a 200-question radiology MCQ benchmark with option-level safety annotations (high-risk error, unsafe answer, contradiction). Across 34 locally deployed LLMs and six deployment conditions, the authors report that clinician-written clean evidence produces large gains in both accuracy (73.5% → 94.1%) and safety metrics, while standard RAG, agentic RAG, max-context prompting, self-consistency, and three-model ensembling do not reproduce that profile. A two-way variance decomposition attributes 38–45% of variance in the safety metrics to deployment condition vs. 8–17% to model family. Worst-case analysis identifies a small subset of questions that defeat most models. The central claim is that clinical LLM safety is governed primarily by evidence quality and deployment design, not by scale or inference-time compute.

Significance. If the central claim holds, the paper makes a useful methodological contribution: it operationalizes option-level clinical safety (severity-aware error labeling, contradiction, dangerous overconfidence), and it shows that ensembling and self-consistency can introduce synchronized failure rather than safety. The factorial design (34 × 6 × 200 ≈ 40,800 model-condition-question evaluations), bootstrap CIs over questions with shared resampling indices for paired comparisons, deterministic tie-breaking, threshold-sensitivity analysis for dangerous overconfidence (Suppl. Note 3, Suppl. Fig. 2), public benchmark release, and full code/prompt/configuration release are real strengths and should be credited. The synchronized-failure finding for ensembles (Fig. 6d,e) and the demonstration that confidence remains high even on high-risk errors (Fig. 3a,d) are independently useful and largely robust to the issues raised below.

major comments (5)

[Methods (Clinical safety augmentation); Results (Table 1)] The 'clean evidence' condition is constructed by the same annotator (T.T.N.) as 'a concise explanation of why the correct answer was correct,' restricted to the key clinical and imaging facts and avoiding option letters. This is closer to an answer-key paraphrase than to retrieved clinical evidence. The headline contrast 'clean evidence vs. standard RAG / agentic RAG' therefore conflates two distinct interventions: (i) supplying a near-oracle hint and (ii) supplying higher-quality retrieved evidence. The 73.5% → 94.1% accuracy jump and the within-family variance collapse in Fig. 4f are consistent with models copying a near-oracle hint; they do not, on their own, support the framing in the Abstract and Discussion that 'evidence quality and deployment design dominate over scale and RAG.' I recommend either (a) reframing clean evidence as an upper bound / oracle reference rather than a depl
[Methods (Clinical safety augmentation); Eq. (5)] Contradiction is defined as 'the option directly contradicted the clean evidence' and is 'judged only against clean evidence, not conflict evidence.' In the clean-evidence condition the model is shown the exact reference text against which contradiction is scored. The reduction in contradiction from 12.7 → 2.3 is therefore partly mechanical: shown the reference text, a model that copies it cannot contradict it. This should be acknowledged explicitly, and the cross-condition contradiction comparison should either be removed from the headline claims or be supplemented by a contradiction metric that is held fixed (e.g., scored against an external clinician statement that is not placed in the model's prompt under any condition).
[Methods (Clinical safety augmentation); Limitations] All option-level safety labels (high-risk, unsafe, contradiction) — which feed every safety endpoint and the dangerous-overconfidence definition in Eq. (7) — were assigned by a single board-certified radiologist (T.T.N.). The paper itself states that labeling required 'implicit assumptions about how an answer choice would translate into clinical action,' especially for negation, physics, and radiation-therapy questions. With no inter-rater reliability data, the magnitudes of the cross-condition deltas (e.g., 12.0 → 2.6 high-risk error) are sensitive to systematic annotator bias in a way that cannot be quantified from the current submission. A second independent annotator on at least a stratified sample of the 865 options, with Cohen's kappa or equivalent, is needed to support the strength of the headline claims.
[Methods (Deployment conditions); Data availability] 169 of 200 questions originate from the authors' prior RadioRAG and RaR datasets, and the agentic RAG comparator is the authors' RaR framework while the standard RAG comparator is adapted from the authors' RadioRAG implementation. Comparing 'clean evidence' against these particular RAG implementations on a benchmark partially derived from those same systems' question pools is a self-comparison; the conclusion that 'standard RAG and agentic RAG do not reproduce the safety profile of clean evidence' is therefore conditional on these specific pipelines and corpora. The Discussion already notes this, but the Abstract and section headings ('Safety and accuracy follow different scaling laws') generalize beyond what the experiment supports. Either soften the framing or add at least one third-party RAG baseline (e.g., a non-author medical RAG system or a different retrieval corpus) under identic
[Results (Confidence is not a safety signal); Eq. (2)] Confidence is defined as 1 − normalized entropy of the empirical ballot distribution over k stochastic generations (k = 20, 5, or 3 depending on model). Three models (DeepSeek-R1, DeepSeek-V3.2, Llama-4-Scout) use much smaller k, which mechanically constrains the achievable confidence resolution and the dangerous-overconfidence rate. Because dangerous overconfidence requires P_mqc ≥ 0.80, models with k = 3 can hit thresholds with single-vote shifts that are not comparable to k = 20 models. Please report dangerous overconfidence separately for the k = 20 subpanel, and verify that the headline cross-condition shifts are not driven by the small-k models.

minor comments (8)

[Abstract] The numbers reported for dangerous overconfidence in the Introduction (8.0 → 17.9 → 3.7) do not match Table 1 (8.0 → 1.6) or the Abstract (8.0 → 1.6). The 17.9 figure appears to be a stale value; please reconcile.
[Table 1] Safety-metric rows in Table 1 give point estimates only. Since bootstrap CIs are computed for accuracy, please also report 95% CIs for high-risk error, unsafe, contradiction, and dangerous overconfidence, at least for the headline closed-book vs. clean-evidence comparison.
[Fig. 4d (variance decomposition)] The OLS variance decomposition is unweighted across a heterogeneous model panel (parameter counts spanning 0.5B–685B). Consider reporting a sensitivity analysis with model-size-bucket weights, and clarify that the 'family' factor confounds developer choices, training corpora, and instruction tuning.
[Methods (Inference protocol)] The verifier is Mistral-Small-4-119B-2603, which is also an evaluated target model under a different name elsewhere. Please confirm that the verifier's role was strictly letter-mapping (no semantic judgment) and add a brief audit (e.g., manual check of n=200 verifier mappings) showing the verifier did not introduce systematic bias for any model family.
[Results (Worst-case failures)] Table 4 reports 15 questions with very high cross-model failure rates (Q53: 100% wrong, 97.1% high-risk). Some of these may simply be mislabeled or genuinely ambiguous — the fact that 34/34 models fail and 34/34 contradict the supplied evidence should at least trigger an editorial check on those specific items.
[Methods (Deployment conditions)] Standard RAG is referenced internally as 'top_10' (Suppl. Note 1). Please state explicitly in the main text which retrieval depth is the 'standard RAG' condition, since top-1, top-5, and top-10 are all listed in the codebase and Suppl. Table 8.
[Front matter] The arXiv identifier on the title page (2605.04039) and the reported model release dates (e.g., 'April 2026', 'May 2025') are inconsistent with current dating conventions; please verify.
[Methods (Statistical analysis)] The variance-decomposition R² components (43+9+6, 45+8+4, 38+17+8) leave large residuals (≈42%, ≈42%, ≈37%). This residual is consistent with substantial within-family, within-condition variation that is not captured by the two-factor model and deserves a sentence of acknowledgement in the main text, not only the figure caption.

Simulated Author's Rebuttal

5 responses · 0 unresolved

We thank the referee for a careful and constructive review. The five major comments converge on a coherent and, in our view, correct critique: as currently framed, the 'clean evidence' condition functions as an oracle clinician rationale rather than as a deployable evidence-supply intervention, and several of the headline contrasts (contradiction reduction, RAG comparators, dangerous-overconfidence aggregation) inherit interpretive weaknesses from that framing and from related design choices (single annotator, author-internal RAG pipelines, heterogeneous k for confidence). We accept these points and propose a substantive revision that (i) reframes clean evidence as an oracle upper bound throughout, (ii) adds a held-out contradiction metric not placed in any prompt, (iii) introduces a second-rater inter-rater reliability study with Cohen's kappa on a stratified sample of the 865 options, (iv) adds a third-party RAG baseline (MedRAG/MedCorp) and a corpus-disjoint baseline plus a New-31-only sensitivity analysis, and (v) reports dangerous overconfidence separately for the k=20 subpanel and segregates the three small-k models. The referee identifies the synchronized-failure result, the confidence non-discrimination result, and the broader methodological contribution as largely robust to these issues, and we agree these claims do not require structural change. We believe the revised manuscript will support a narrower but better-supported version of the central claim and will addre

read point-by-point responses

Referee: Clean evidence is constructed by the same annotator as 'a concise explanation of why the correct answer was correct', which is closer to an answer-key paraphrase than to retrieved clinical evidence. The headline contrast with RAG conflates supplying a near-oracle hint with supplying higher-quality retrieved evidence; clean evidence should be reframed as an upper bound / oracle reference.

Authors: We accept this point. Clean evidence was deliberately constructed as a concise, clinician-written rationale that excludes option letters, but the referee is correct that it sits closer to an oracle rationale than to ecologically valid retrieved evidence. The 73.5% → 94.1% jump and the within-family variance collapse (Fig. 4f) are consistent with both 'evidence quality' and 'near-oracle copying' interpretations, and the manuscript does not currently disentangle them. In revision we will: (i) rename the condition 'oracle clinician rationale (upper bound)' throughout the Methods, Results, and Figures; (ii) reframe the Abstract and Discussion to position this condition as an upper-bound reference rather than a deployable comparator to RAG; (iii) restate the central claim as 'safety-relevant gains from standard and agentic RAG remain far below the oracle-rationale ceiling, and inference-time compute does not close the gap'; and (iv) retain conflict evidence as a paired stress test of the same upper-bound construct. The synchronized-failure, confidence-non-discrimination, and inference-time-compute findings, which the referee notes are largely robust, are not affected by this reframing. revision: yes
Referee: Contradiction is judged against clean evidence shown in the prompt; the 12.7 → 2.3 reduction is therefore partly mechanical. Either remove from headline claims or supplement with a contradiction metric scored against an external clinician statement not placed in the prompt.

Authors: Agreed. As written, contradiction under the clean-evidence condition is partially tautological: a model that reproduces the supplied rationale cannot, by construction, contradict it. We will (i) explicitly state this circularity in the Methods (Eq. 5) and in the Results paragraph reporting the 12.7 → 2.3 figure; (ii) remove the cross-condition contradiction delta from the Abstract headline; and (iii) add a held-out contradiction analysis in which contradiction is scored against an independent clinician statement that is never inserted into any model prompt. Because the option-level contradiction labels in RadSaFE-200 were already authored separately from the per-question clean-evidence text, we can compute this held-out variant from existing annotations without re-running inference, and we will report it as the primary cross-condition contradiction metric. Closed-book vs. RAG contradiction comparisons, which do not share this circularity, will be retained. revision: yes
Referee: All option-level safety labels were assigned by a single radiologist, with no inter-rater reliability. A second independent annotator on at least a stratified sample with Cohen's kappa is needed.

Authors: We accept this limitation, which is already partially flagged in the Limitations section but not quantified. For the revision, a second board-certified radiologist (independent of T.T.N.) will re-annotate a stratified random sample of options covering all three labels (high-risk, unsafe, contradiction), all subspecialties, and over-sampling the technical, physics, radiation-therapy, and negation-style items that the manuscript identifies as most subjective. We target ~250 options (≈30% of 865), stratified by label prevalence, and will report Cohen's kappa per label, percentage agreement, and a disagreement-adjudication summary. We will additionally repeat the headline cross-condition deltas using only options on which both annotators agreed, as a sensitivity analysis. We expect this work to add several weeks to the revision timeline but consider it necessary to support the magnitude claims. revision: yes
Referee: 169 of 200 questions originate from the authors' prior RadioRAG and RaR datasets, and the RAG comparators are the authors' own pipelines; the conclusion that 'standard and agentic RAG do not reproduce the safety profile of clean evidence' is therefore a self-comparison. Either soften the framing or add a third-party RAG baseline.

Authors: We accept the self-comparison concern and will address it on both fronts. First, we will soften the section heading 'Safety and accuracy follow different scaling laws' and the corresponding Abstract sentence to make explicit that the RAG conclusions are conditional on the RadioRAG and RaR pipelines and on Radiopaedia as the corpus. Second, we will add at least one third-party RAG baseline under the same prompt, verifier, and scoring protocol. Our planned additions are (a) MedRAG (Xiong et al.) over the MedCorp corpus as a non-author medical RAG system, and (b) a BM25 + dense-hybrid retriever over PubMed abstracts as a corpus-disjoint baseline. We will report these on the same five safety endpoints and update Table 1 and Fig. 2 accordingly. We will also include a leave-source-out sensitivity analysis restricted to the New-31 subset (which does not originate from RadioRAG or RaR) to test whether the qualitative ordering of conditions is preserved on questions external to the authors' prior systems. revision: yes
Referee: Confidence is derived from k stochastic generations with k=20, 5, or 3 across models. Small-k models can hit the P_mqc ≥ 0.80 threshold via single-vote shifts, which is not comparable to k=20 models. Report dangerous overconfidence separately for the k=20 subpanel and verify headline shifts are not driven by small-k models.

Authors: This is a valid concern that we had not addressed quantitatively. The k=3 and k=5 models are DeepSeek-R1, DeepSeek-V3.2, and Llama-4-Scout, and the referee is correct that with k=3 the achievable confidence values are restricted to {0, 0.5, 1} on the modal vote share, making the 0.80 threshold mechanically easy to cross. In revision we will: (i) add a k=20-only subpanel to Table 1 and recompute model-averaged dangerous overconfidence and the cross-condition deltas on the 31-model subset; (ii) report the three small-k models separately rather than pooling them in the headline averages; (iii) add a Supplementary Figure analogous to Fig. 3e–g restricted to k=20; and (iv) re-run the threshold-sensitivity analysis (Suppl. Fig. 2) on the k=20 subpanel. From a preliminary check, excluding the three small-k models leaves the qualitative ordering of conditions unchanged (clean ≪ conflict ≪ RAG variants ≈ closed-book ≈ max-context), but we will report exact revised values in the next version. We will also explicitly caveat the dangerous-overconfidence definition as resolution-limited for small k and recommend k ≥ 20 for future deployments of the metric. revision: yes

Circularity Check

3 steps flagged

Low-to-moderate circularity: the contradiction-rate drop under clean evidence is partly definitional because the same annotator wrote both the clean-evidence text and the option-level contradiction labels scored against it; the rest of the chain is empirically grounded.

specific steps

self definitional [Methods, Clinical safety augmentation; Outcome definitions Eq. 5]
"For each question, clean evidence was written as a concise explanation of why the correct answer was correct... The contradiction label was set to 1 if the option directly contradicted the clean evidence. Contradiction was judged only against clean evidence, not conflict evidence."

The contradiction labels D_{q,j} are defined against the clean-evidence text, and the same annotator (T.T.N.) authored both the clean-evidence text and the labels. In the clean-evidence deployment condition the model is given that exact reference text. The headline drop in contradiction rate from 12.7% to 2.3% therefore partly reduces to: when shown reference text X, models select fewer options pre-labeled as contradicting X. This is a definitional coupling rather than an independently discovered safety property, and it is presented as one of four headline safety axes.
fitted input called prediction [Results, 'Safety and accuracy decouple'; Methods, Clean-evidence prompting]
"Clean evidence... a concise explanation of why the correct answer was correct. The evidence was limited to the key clinical and imaging facts, avoided option letters... Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%."

The clean-evidence text is a clinician-authored rationale for the correct option, distributed only at evaluation time. Functionally it is a near-oracle hint, not a 'higher-quality retrieved passage.' Comparing it to standard/agentic RAG (which retrieves Radiopaedia passages by similarity) conflates two distinct factors — answer-keyness and retrieval quality — and lets a quasi-oracle stand in for 'evidence quality.' The paper's framing 'evidence quality dominates over scale and RAG' partly follows by construction of the clean-evidence channel.
self citation load bearing [Methods, RadSaFE-200 benchmark construction; Deployment conditions]
"The benchmark pools questions from three predefined subsets: the RadioRAG subset [13]... the RaR subset [12]... Agentic RAG was implemented using the previously described radiology Retrieval-and-Reasoning framework [12]."

169/200 questions, the standard-RAG pipeline (RadioRAG, ref. 13), and the agentic-RAG implementation (RaR, ref. 12) are all prior work by the same author group. The agentic-RAG-vs-clean-evidence comparison is therefore partly an in-house comparison between two of the authors' own systems. This is mitigated by open code/data and external model panel, so it raises the score modestly rather than dominating it.

full rationale

The paper's headline empirical claims (clean evidence beats RAG/scaling on safety) rest on direct measurement against external models, so the central derivation is not tautological. Variance decomposition, latency, accuracy, and dangerous-overconfidence comparisons are computed against 34 externally trained LLMs and would falsify the hypothesis if those models happened to use clean evidence poorly. That is real evidence and is not circular. There is, however, one identifiable circular step. The contradiction label is, by construction, defined against the clean-evidence text ("Contradiction was judged only against clean evidence, not conflict evidence"), and the clean-evidence text plus the option-level contradiction labels were both authored by the same annotator (T.T.N.). When the model is shown the clean-evidence text in the clean-evidence condition, the reduction of "contradiction" from 12.7% to 2.3% partly reduces to: models shown reference text X pick options pre-labeled as contradicting X less often. This is closer to a sanity check than a discovered safety property, and the paper presents it as a headline safety axis. The same annotator-authored-evidence-vs-annotator-authored-labels coupling weakly inflates the high-risk and accuracy gaps under clean evidence as well, because clean evidence functions as a near-oracle rationale rather than a "high-quality retrieval" baseline. Self-citation load: 169/200 questions and the agentic-RAG (RaR) and standard-RAG (RadioRAG) pipelines are from the same author group's prior work. This is disclosed and the implementations are open-source, so it does not by itself create circularity — but it does mean the agentic-RAG-vs-clean-evidence comparison is partly an internal benchmark comparison. The uniqueness/forced-choice patterns (kinds 3–5) are not present: no theorem from the authors' prior work is invoked to forbid alternatives, no ansatz is smuggled in via citation, and the safety-metric construction is stated explicitly with equations rather than imported. Net: one mechanically definitional step (contradiction label vs. clean evidence) and a benchmark-overlap concern that the reader and authors both flag. The accuracy, high-risk, dangerous-overconfidence, latency, ensemble-synchronization, and scaling-decoupling findings are not reducible to inputs and constitute genuine empirical content. Score 3.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Standard empirical ML benchmarking paper. No new physical or theoretical entities are postulated. The central claim rests on (a) the validity of single-annotator option-level safety labels as a proxy for clinical harm, (b) the choice of confidence threshold (0.80) and the entropy-normalized repeated-sampling stability as the operational definition of "confidence," and (c) the representativeness of 200 multiple-choice radiology questions for clinical risk. No fitted model parameters in the ML sense; the framework introduces design choices rather than free numerical parameters.

pith-pipeline@v0.9.0 · 75693 in / 7326 out tokens · 115050 ms · 2026-05-06T04:51:51.027608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.Jcost / Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Confidence was computed as one minus the entropy of this distribution, normalised by the size of the full ballot space ... P_mqc = 1 − (−Σ p log p)/log(|A|+1).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.