Recognition: no theorem link
Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following
Pith reviewed 2026-05-15 07:56 UTC · model grok-4.3
The pith
A momentum proxy instruction enables fine-tuning of medical LVLMs using only image-description pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core discovery is that fine-tuning large vision language models on image-description pairs alone, guided by a momentum proxy instruction instead of curated instructions, allows the model to maintain its instruction-following capability from pre-training. This enables the model to respond effectively to medical domain instructions at inference time. The addition of a response shuffling strategy reduces dependency on sequential word predictions, leading to improved performance on visual question answering tasks in medical imaging.
What carries the argument
The momentum proxy instruction, which replaces explicit instructions by using a dynamically updated version derived from the model's own outputs to guide fine-tuning on image-description pairs.
If this is right
- Fine-tuning becomes feasible with readily available image-description pairs rather than expert-crafted instructions.
- The fine-tuned model responds flexibly to domain-specific instructions at test time despite their absence during training.
- Accuracy improves on multiple-choice VQA for SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets.
- Overall efficiency of adapting LVLMs to medical domains increases.
Where Pith is reading between the lines
- This approach may apply to other fields where expert-annotated instructions are costly or scarce.
- Testing on open-ended response tasks could reveal whether the proxy maintains performance beyond multiple-choice formats.
- Integrating this with parameter-efficient methods could further lower the resources needed for domain adaptation.
- The response shuffling might address similar over-reliance issues in other sequence models.
Load-bearing premise
The momentum proxy instruction successfully maintains the pre-trained instruction-following capability while ensuring that updated parameters remain effective when actual instructions are used at inference.
What would settle it
A controlled experiment in which models fine-tuned with the momentum proxy are then tested on entirely new medical instructions never reflected in any proxy form; failure to follow those instructions would disprove the preservation of capability.
read the original abstract
Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an instruction-free tuning method for large vision-language models (LVLMs) in the medical domain. It fine-tunes models using only image-description pairs by introducing a momentum proxy instruction to substitute for curated instructions and a response shuffling strategy to reduce over-reliance on prior tokens. The central claim is that this preserves the pre-trained model's instruction-following capability, enabling flexible responses to domain-specific medical instructions at inference, and yields state-of-the-art accuracy on multiple-choice visual question answering across the SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets.
Significance. If the empirical results hold under rigorous verification, the approach could meaningfully reduce dependence on expert-annotated instruction data, a major bottleneck for medical LVLM adaptation. The momentum proxy and shuffling mechanisms offer concrete, potentially reusable ideas for efficient domain transfer. Credit is given for targeting a practical limitation and evaluating on four distinct medical VQA benchmarks.
major comments (3)
- [Method (momentum proxy definition)] The momentum proxy instruction is presented as preserving instruction-following for unseen medical prompts, yet the manuscript provides no derivation, equivalence proof, or ablation demonstrating that the proxy-generated signal aligns parameters with medical reasoning patterns absent from pre-training. This assumption is load-bearing for the instruction-free claim.
- [Experiments] The SOTA accuracy claim on SKINCON, WBCAtt, CBIS, and MIMIC-CXR is stated without reported baselines, ablation tables, error bars, or statistical significance tests. The experimental section must include these to substantiate the efficiency and performance gains.
- [§3.3 (response shuffling)] The response shuffling strategy is introduced to mitigate token over-reliance, but no quantitative results isolate its contribution to handling medical terminology or multi-step reasoning in the reported VQA tasks.
minor comments (2)
- [Method] Notation for the momentum update rule should be formalized with explicit equations to avoid ambiguity in how the proxy is computed from image-description pairs.
- [Experiments] Dataset statistics (number of image-description pairs used per benchmark) and training hyperparameters are missing from the main text and should be added for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our instruction-free tuning method for medical LVLMs. We address each major comment below and will revise the manuscript to strengthen the presentation of the momentum proxy, experimental results, and response shuffling analysis.
read point-by-point responses
-
Referee: The momentum proxy instruction is presented as preserving instruction-following for unseen medical prompts, yet the manuscript provides no derivation, equivalence proof, or ablation demonstrating that the proxy-generated signal aligns parameters with medical reasoning patterns absent from pre-training. This assumption is load-bearing for the instruction-free claim.
Authors: We agree that a more rigorous justification is needed. The momentum proxy is formulated as an exponentially moving average of prior model-generated instructions to dynamically simulate instruction-following signals during fine-tuning on image-description pairs. In the revision we will add the explicit mathematical definition, a brief motivation section explaining its alignment with pre-trained instruction patterns, and an ablation comparing performance with and without the proxy on held-out medical prompts. revision: yes
-
Referee: The SOTA accuracy claim on SKINCON, WBCAtt, CBIS, and MIMIC-CXR is stated without reported baselines, ablation tables, error bars, or statistical significance tests. The experimental section must include these to substantiate the efficiency and performance gains.
Authors: We acknowledge the need for stronger empirical support. The revised manuscript will include (i) comparisons against standard fine-tuning and recent medical LVLM baselines, (ii) full ablation tables for the momentum proxy and shuffling components, (iii) error bars computed over multiple random seeds, and (iv) paired statistical significance tests (e.g., McNemar or t-tests) for the reported accuracy improvements. revision: yes
-
Referee: The response shuffling strategy is introduced to mitigate token over-reliance, but no quantitative results isolate its contribution to handling medical terminology or multi-step reasoning in the reported VQA tasks.
Authors: We will expand §3.3 with targeted ablations that isolate response shuffling. Specifically, we will report accuracy deltas with and without shuffling on each dataset, together with qualitative examples highlighting improvements in medical term accuracy and multi-step reasoning chains, thereby quantifying its contribution beyond the overall SOTA numbers. revision: yes
Circularity Check
No circularity: method components are independent additions
full rationale
The paper presents the momentum proxy instruction and response shuffling as explicit engineering choices added to image-description pair fine-tuning; these are not defined in terms of the target outputs or fitted from the same evaluation data. No equations reduce any claimed prediction to an input by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the SOTA accuracy statements rest on reported empirical results across the listed datasets rather than on any renaming or ansatz smuggling. The derivation chain therefore remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained LVLMs possess instruction-following capability that can be preserved via a momentum proxy instruction during fine-tuning on image-description pairs.
invented entities (1)
-
momentum proxy instruction
no independent evidence
Reference graph
Works this paper leans on
-
[1]
integrated CLIP’s vision encoder [15] with LLaMA [11], and was fine-tuned on an instruction dataset generated by language-only GPT-4 [12]. Recently, LLaMA-3.2-Vision [16] extended LLaMA 3.1 by adapting cross-attention layers to integrate the image modality with language. Additionally, alternative LLMs have been integrated into the vision domain, such as Q...
-
[2]
and MedGemma [3] have been fine-tuned on carefully curated medical datasets from PMC and other publicly avail- able sources, significantly enhancing their medical capabilities. One study [18] attempted instruction-free tuning by using language-only instruction-output pairs and image-caption pairs with a fixed set of instructions. However, none of the exis...
-
[3]
Proxy Instruction:Let the proxy instructiont= {t1, . . . , tN }be defined as a set ofNcontinuous vectors, each with the same dimensionality as the word embeddings. The proxy instruction replaces the text instruction (e.g., a question) in the promptX p with learnable vectors during fine-tuning, in order to preserve the LVLM’s pre-trained instruction-follow...
-
[4]
Momentum Proxy Instruction:Although the optimized in- structiontis well aligned with its corresponding descriptions, tis replaced with a conversational text instruction at infer- ence. This can lead to overfitting and over-reliance on proxy instructions, thereby degrading the inference performance of a fine-tuned LVLM. To mitigate this issue, we aim not t...
-
[5]
Notably, to improve efficiency, we first optimizetwhile keepinggfrozen as a warm-up stage, and then use the fine- tunedtto initialize ¯t, rather than using random initialization. Algorithm 1Instruction-free tuning process. 1:Input:Vision encoderg, language modelf, ground truth de- scriptiony, learning rateη, momentum coefficientα 2:t← N(0, σ 2) 3:whilenot...
work page 2011
-
[6]
Main Results:We first compared our method against other LVLMs without fine-tuning (w/oFT) on the SKIN- CON, WBCAtt, and CBIS datasets, including general LVLMs such as LLaMA-3.2-11B-Vision-Instruct [16] and Qwen2.5- VL-3B-Instruct [17], as well as medical LVLMs such as PubMedVision-7B-Qwen2.5VL [4] and MedGemma-4B-it [3]. We then compared our method with f...
-
[7]
Comparison with Instruction-Free Tuning Variants:We compared our method with two instruction-free tuning vari- 7 T ABLE I: Multiple-choice VQA accuracy on the SKINCON, WBCAtt, and CBIS datasets. We compared our method (InstFree) with BLIP-2 [32], MedGemma-4B [3], PubMedVision-7B [4], Qwen2.5-VL-3B [17], and LLaMA-3.2-11B-Vision [16]. FT denotes fine-tunin...
work page 1939
-
[8]
”) to demonstrate that accuracy degrades when response shuffling uses incorrect separators such as “
Ablation on Response Shuffling:We conducted ablation studies on response shuffling using the SKINCON, WBCAtt, and CBIS datasets. First, we compared InstFree w/ Bal (bal- anced sampling) to investigate whether the issue originates from the model overfitting to previous word correlations or recurring response patterns. We calculate the frequency of each wor...
-
[9]
Discussion of Misalignment:As shown in Table IV (FTw/ Rand), fine-tuning the model to generate consistent responses across a broad range of instructions leads to a significant degradation of its pre-trained instruction-following capabil- ity. This degradation appears to stem from a misalignment between the fine-tuning dataset and the pre-training data, as...
-
[10]
Ablation on Coefficients and Instruction Scale:We con- ducted ablation studies on the momentum proxy instruction using the SKINCON, WBCAtt, and CBIS datasets by varying the momentum coefficient and the number of instruction vec- tors to identify optimal hyperparameters. For the momentum coefficientα, we experimented with values of 0.9, 0.99, 0.999, and 0....
-
[11]
Llava-med: Training a large language-and-vision assistant for biomedicine in one day,
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023
work page 2023
-
[12]
Lima: Less is more for alignment,
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yuet al., “Lima: Less is more for alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 55 006–55 021, 2023
work page 2023
-
[13]
A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Towards injecting medical visual knowledge into multimodal llms at scale,
J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. Chen, X. Wang, Z. Cai, K. Ji, X. Wanet al., “Towards injecting medical visual knowledge into multimodal llms at scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 7346– 7370
work page 2024
-
[15]
S. Moon, S. Pakhomov, N. Liu, J. O. Ryan, and G. B. Melton, “A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources,”Journal of the American Medical Informatics Association, vol. 21, no. 2, pp. 299–307, 2014
work page 2014
-
[16]
Challenges in clinical natural language processing for automated disorder normalization,
R. Leaman, R. Khare, and Z. Lu, “Challenges in clinical natural language processing for automated disorder normalization,”Journal of biomedical informatics, vol. 57, pp. 28–37, 2015
work page 2015
-
[17]
Toward best practices in radiology reporting,
C. E. Kahn Jr, C. P. Langlotz, E. S. Burnside, J. A. Carrino, D. S. Channin, D. M. Hovsepian, and D. L. Rubin, “Toward best practices in radiology reporting,”Radiology, vol. 252, no. 3, pp. 852–856, 2009
work page 2009
-
[18]
Multi-modal hallucination control by visual information grounding,
A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto, “Multi-modal hallucination control by visual information grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 303–14 312
work page 2024
-
[19]
Cross-task general- ization via natural language crowdsourcing instructions,
S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task general- ization via natural language crowdsourcing instructions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3470–3487
work page 2022
-
[20]
Alpaca: A strong, replicable instruction- following model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- following model,”Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023
work page 2023
-
[21]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Self-instruct: Aligning language models with self- generated instructions,
Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self- generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13 484–13 508
work page 2023
-
[24]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023
work page 2023
-
[25]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
work page 2021
-
[26]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Vift: Towards visual instruction-free fine-tuning for large vision-language models,
Z. Liu, K. Zhou, X. Zhao, D. Gao, Y . Li, and J.-R. Wen, “Vift: Towards visual instruction-free fine-tuning for large vision-language models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 10 341–10 366
work page 2025
-
[29]
An image is worth 16x16 words: Trans- formers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021
work page 2021
-
[30]
Flamingo: a visual language model for few-shot learning,
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022
work page 2022
-
[31]
torchtune: Pytorch’s finetuning library,
torchtune, “torchtune: Pytorch’s finetuning library,” https://github.com/pytorch/torchtune, 2024
work page 2024
-
[32]
Prefix-tuning: Optimizing continuous prompts for generation,
X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597
work page 2021
-
[33]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738
work page 2020
-
[34]
R. Daneshjou, M. Yuksekgonul, Z. R. Cai, R. Novoa, and J. Y . Zou, “Skincon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,”Advances in Neural Information Processing Systems, vol. 35, pp. 18 157–18 167, 2022
work page 2022
-
[35]
M. Groh, C. Harris, L. Soenksen, F. Lau, R. Han, A. Kim, A. Koochek, and O. Badri, “Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1820–1828
work page 2021
-
[36]
Wbcatt: A white blood cell dataset annotated with detailed morphological attributes,
S. Tsutsui, W. Pang, and B. Wen, “Wbcatt: A white blood cell dataset annotated with detailed morphological attributes,”Advances in Neural Information Processing Systems, vol. 36, pp. 50 796–50 824, 2023
work page 2023
-
[37]
A curated mammography data set for use in computer-aided detection and diagnosis research,
R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin, “A curated mammography data set for use in computer-aided detection and diagnosis research,”Scientific data, vol. 4, no. 1, pp. 1–9, 2017
work page 2017
-
[38]
Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports,
A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports,”Scientific data, vol. 6, no. 1, p. 317, 2019
work page 2019
-
[39]
J. M. Zambrano Chaves, S.-C. Huang, Y . Xu, H. Xu, N. Usuyama, S. Zhang, F. Wang, Y . Xie, M. Khademi, Z. Yanget al., “A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings,”Nature Communications, vol. 16, no. 1, p. 3108, 2025
work page 2025
-
[40]
N. Gaggion, C. Mosquera, L. Mansilla, J. M. Saidman, M. Aineseder, D. H. Milone, and E. Ferrante, “Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images,” Scientific Data, vol. 11, no. 1, p. 511, 2024
work page 2024
-
[41]
Interpretable medical image visual question answering via multi-modal relationship graph learning,
X. Hu, L. Gu, K. Kobayashi, L. Liu, M. Zhang, T. Harada, R. M. Summers, and Y . Zhu, “Interpretable medical image visual question answering via multi-modal relationship graph learning,”Medical Image Analysis, vol. 97, p. 103279, 2024
work page 2024
-
[42]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[43]
S. Goyal, C. Baek, J. Z. Kolter, and A. Raghunathan, “Context- parametric inversion: Why instruction finetuning may not actually im- prove context reliance,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.