Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

Carsten Eickhoff; Daniel Shubin; Michal Golovanevsky; Oliver McLaughlin; Ritambhara Singh; William Rudman

arxiv: 2604.09841 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI

Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models

Oliver McLaughlin , Daniel Shubin , Carsten Eickhoff , Ritambhara Singh , William Rudman , Michal Golovanevsky This is my paper

Pith reviewed 2026-05-10 17:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsmedical fine-tuningprompt sensitivityclinical reasoningmedical imagingvision encoders

0 comments

The pith

Medically fine-tuned vision-language models show no reliable advantage and degrade on harder clinical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether fine-tuning vision-language models on medical data imparts genuine clinical reasoning. It tests general and medically fine-tuned model pairs on four imaging tasks that increase in difficulty from basic classification to histopathology. Results indicate that accuracy falls to near-random levels as tasks grow harder, medical fine-tuning provides no consistent gain, and models react strongly to small prompt changes. An alternative pipeline of generating descriptions then diagnosing with a text model yields only limited improvement, suggesting the core issues lie in visual feature quality and reasoning depth.

Core claim

Across paired general and medically fine-tuned vision-language models, domain-specific fine-tuning yields no consistent performance advantage on medical imaging tasks, while both model types display extreme sensitivity to prompt formulation and performance that approaches random guessing as clinical reasoning demands increase.

What carries the argument

Comparative testing of LLaVA versus LLaVA-Med and Gemma versus MedGemma on tasks of graded difficulty, plus a two-stage description-then-diagnosis pipeline using a text-only model to probe for suppressed knowledge.

If this is right

Domain fine-tuning cannot be assumed to produce robust medical reasoning in VLMs.
Prompt design must be treated as a major variable in deployment, since small changes alter outcomes substantially.
Weak visual embeddings contribute to failures independently of the language component.
Standard fine-tuning on image-label pairs leaves substantial clinical knowledge unextracted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may benefit from training that explicitly targets reasoning steps rather than end-to-end classification.
The same prompt fragility could limit VLMs in other specialized domains requiring precise interpretation.
Improving the vision encoder separately might address part of the performance gap before further fine-tuning.

Load-bearing premise

The four medical tasks represent a true increasing scale of clinical reasoning difficulty and the paired models are similar enough in base capabilities that differences can be attributed mainly to medical fine-tuning.

What would settle it

Finding a medically fine-tuned model that maintains stable high accuracy across all tasks regardless of prompt variations or that shows no performance decline as task difficulty increases.

Figures

Figures reproduced from arXiv: 2604.09841 by Carsten Eickhoff, Daniel Shubin, Michal Golovanevsky, Oliver McLaughlin, Ritambhara Singh, William Rudman.

**Figure 2.** Figure 2: Small prompt changes cause inconsistent decisions in both medical and general VLMs. Our results show that medical fine-tuning only provides gains on tasks requiring limited clinical knowledge, yet produces random performance on challenging diagnostic tasks, indicating that current medicalVLMs do not possess true clinical reasoning. We first evaluate whether domain-specific fine-tuning translates into … view at source ↗

**Figure 3.** Figure 3: Performance (F1all) across disease classification tasks for base and medically fine-tuned models, averaged across all prompt variants. Medical fine-tuning yields improvements on some tasks but does not consistently outperform base models. The guided variant includes a short list of clinically relevant signs (s1, . . . ,sn) associated with a positive diagnosis. Full prompt and symptom lists are provided in… view at source ↗

**Figure 4.** Figure 4: t-SNE projections of vision encoder embeddings of the four datasets. Each [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Vision-language models (VLMs) are increasingly adapted through domain-specific fine-tuning, yet it remains unclear whether this improves reasoning beyond superficial visual cues, particularly in high-stakes domains like medicine. We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks of increasing difficulty: brain tumor, pneumonia, skin cancer, and histopathology classification. We find that performance degrades toward near-random levels as task difficulty increases, indicating limited clinical reasoning. Medical fine-tuning provides no consistent advantage, and models are highly sensitive to prompt formulation, with minor changes causing large swings in accuracy and refusal rates. To test whether closed-form VQA suppresses latent knowledge, we introduce a description-based pipeline where models generate image descriptions that a text-only model (GPT-5.1) uses for diagnosis. This recovers a limited additional signal but remains bounded by task difficulty. Analysis of vision encoder embeddings further shows that failures stem from both weak visual representations and downstream reasoning. Overall, medical VLM performance is fragile, prompt-dependent, and not reliably improved by domain-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Medical fine-tuning adds little reliable edge on harder tasks and these VLMs swing wildly with prompt wording, though the model pairs may not isolate the tuning effect cleanly.

read the letter

The main thing to know is that this paper finds medical fine-tuning gives no consistent advantage over base models on the four tasks, with accuracy dropping toward random on the harder ones and small prompt tweaks causing big drops in performance or more refusals. The description pipeline they add, where the VLM writes an image description and a text model then diagnoses, recovers only limited extra signal before hitting the same difficulty wall. Embedding checks point to problems in both the vision side and the reasoning side. That multi-task pattern plus the recovery test is the concrete new piece here, and it lines up with what many of us have seen anecdotally when trying to push VLMs into medicine. The paired evaluations across LLaVA/LLaVA-Med and Gemma/MedGemma are a reasonable way to surface the issue without overclaiming. The work is useful for anyone who fine-tunes VLMs for imaging because it shows the current recipe does not reliably produce clinical-level reasoning. The soft spots sit mainly in the controls. The stress-test point holds some weight: if the medically tuned versions differ from the bases in more than just the medical data step, then the lack of advantage could trace to other training differences rather than fine-tuning itself. The task ordering also assumes a clean rise in clinical reasoning demand, but visual feature difficulty does not always track that way, so the degradation might have mixed causes. The abstract leaves out error bars and exact prompt lists, so the full paper needs those to make the fragility numbers convincing. Overall this is a solid empirical check rather than a breakthrough, but it is honest about the limits it finds. It is worth bringing to a reading group for the discussion on whether we should keep pouring effort into domain fine-tuning or look elsewhere. I would not cite it in my own work soon unless I were writing a survey on adaptation failures, but a serious editor should send it to referees with requests to document the exact base checkpoints and add statistical detail.

Referee Report

3 major / 2 minor

Summary. The paper claims that domain-specific medical fine-tuning of vision-language models does not yield consistent improvements in clinical reasoning on medical imaging tasks. Through evaluations of paired models (LLaVA/LLaVA-Med and Gemma/MedGemma) on four tasks of increasing difficulty (brain tumor, pneumonia, skin cancer, histopathology classification), it finds performance degrading to near-random levels with increasing task difficulty, high sensitivity to prompt changes affecting accuracy and refusal rates, and only limited recovery via a description-based pipeline where image descriptions are generated and then diagnosed by a text-only model. Failures are linked to weak visual representations and reasoning limitations.

Significance. Should the central findings prove robust, this study would demonstrate that current medical fine-tuning strategies for VLMs are insufficient to reliably extract or apply domain-specific knowledge for diagnostic tasks, underscoring the fragility of these models in high-stakes applications. The strengths include the use of paired model comparisons to isolate fine-tuning effects and the introduction of a description-based pipeline to test for suppressed latent knowledge, along with embedding analysis to pinpoint failure modes. These elements provide actionable insights for improving VLM adaptation in medicine.

major comments (3)

[Abstract] The claim that 'medical fine-tuning provides no consistent advantage' depends on the paired models being comparable except for the fine-tuning step. The manuscript does not provide details confirming that LLaVA-Med is derived from the exact same base as LLaVA (same vision encoder, LLM backbone, parameter count, pretraining) and similarly for MedGemma, which is a load-bearing assumption for attributing results to fine-tuning rather than other architectural differences.
[Abstract] The four tasks are positioned as 'increasing difficulty' to demonstrate limited clinical reasoning, but the manuscript offers no evidence or metrics (such as visual complexity measures or expert-rated reasoning demands) to validate that histopathology classification requires more clinical reasoning than brain tumor classification. This weakens the interpretation of the performance degradation trend.
[Abstract] Performance claims including 'degrades toward near-random levels', 'large swings in accuracy', and 'no consistent advantage' are presented without reference to statistical tests, error bars, sample sizes, or variance measures. This leaves the evidence for prompt sensitivity and the main conclusions only moderately supported, as highlighted by the low soundness rating.

minor comments (2)

[Abstract] The text-only model is referred to as 'GPT-5.1'; please clarify if this is a specific version or a placeholder for an existing model like GPT-4o to avoid confusion.
Consider providing the full set of prompts and exact evaluation protocols in an appendix to enhance reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments help clarify key assumptions and strengthen the presentation of our findings on the fragility of medically fine-tuned VLMs. We address each major comment point by point below, indicating revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] The claim that 'medical fine-tuning provides no consistent advantage' depends on the paired models being comparable except for the fine-tuning step. The manuscript does not provide details confirming that LLaVA-Med is derived from the exact same base as LLaVA (same vision encoder, LLM backbone, parameter count, pretraining) and similarly for MedGemma, which is a load-bearing assumption for attributing results to fine-tuning rather than other architectural differences.

Authors: We appreciate this clarification on model comparability. LLaVA-Med is fine-tuned directly from the LLaVA-1.5-7B checkpoint, sharing the identical CLIP ViT-L/14 vision encoder, Vicuna-7B LLM backbone, and pretraining corpus as documented in the original LLaVA-Med work. MedGemma follows the same pattern from the Gemma-2B base model. To address the concern, we have added a dedicated paragraph in the Methods section (with a supporting table) explicitly listing the shared components, parameter counts, and citations to the source papers for each pair. This makes the isolation of fine-tuning effects explicit. revision: yes
Referee: [Abstract] The four tasks are positioned as 'increasing difficulty' to demonstrate limited clinical reasoning, but the manuscript offers no evidence or metrics (such as visual complexity measures or expert-rated reasoning demands) to validate that histopathology classification requires more clinical reasoning than brain tumor classification. This weakens the interpretation of the performance degradation trend.

Authors: The task ordering reflects standard clinical progression from lower-specialization modalities (e.g., brain MRI tumor detection) to higher-expertise domains (e.g., histopathology), consistent with medical education and diagnostic literature. We acknowledge the absence of quantitative proxies such as expert-rated reasoning demands or visual complexity scores in the original submission. We have revised the Introduction and Experimental Setup to include a concise rationale citing domain references, while noting that the observed monotonic degradation trend is robust to reordering. New expert ratings or complexity metrics fall outside the current scope and would require a separate study. revision: partial
Referee: [Abstract] Performance claims including 'degrades toward near-random levels', 'large swings in accuracy', and 'no consistent advantage' are presented without reference to statistical tests, error bars, sample sizes, or variance measures. This leaves the evidence for prompt sensitivity and the main conclusions only moderately supported, as highlighted by the low soundness rating.

Authors: We agree that explicit statistical support improves interpretability. The evaluations used fixed test sets whose sizes are now stated in the Results (typically 200–500 images per task). Prompt-sensitivity experiments involved multiple prompt variants per task; we have added error bars reflecting variance across these variants and noted sample sizes in all figures and tables. Basic comparisons (e.g., accuracy differences between paired models) are now accompanied by effect-size observations. Full per-condition hypothesis testing was not performed in the original exploratory design, but the consistent directional trends across models and tasks support the core claims. We have updated the text and supplementary material accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical benchmarking with external labels

full rationale

The paper is a straightforward empirical study that evaluates four paired VLMs on four medical imaging classification tasks using accuracy, refusal rates, and embedding analysis against ground-truth dataset labels. No derivations, equations, fitted parameters, or predictions are present; results are measured directly rather than derived from model internals or prior self-citations. The description-based pipeline and prompt-sensitivity tests are additional experimental protocols, not self-referential reductions. Model pairing (LLaVA/LLaVA-Med, Gemma/MedGemma) is presented as given for comparison, with performance differences reported as observations rather than forced by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation study. It introduces no free parameters, mathematical axioms, or invented entities. It relies on standard machine-learning assumptions such as the validity of classification accuracy as a performance metric and the representativeness of the selected imaging tasks.

pith-pipeline@v0.9.0 · 5524 in / 1125 out tokens · 50103 ms · 2026-05-10T17:57:11.969490+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate four paired open-source VLMs (LLaVA vs. LLaVA-Med; Gemma vs. MedGemma) across four medical imaging tasks... performance degrades toward near-random levels as task difficulty increases... medical fine-tuning provides no consistent advantage
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Minor wording changes produce large swings in accuracy and refusal rates... prompt formulation can strongly influence measured performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page