arxiv: 2603.19482 · v2 · submitted 2026-03-19 · 💻 cs.CV

Recognition: no theorem link

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Myeongkyun Kang , Soopil Kim , Xiaoxiao Li , Sang Hyun Park

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision language modelsinstruction tuningmedical imagingvisual question answeringfine-tuningmomentum proxyresponse shuffling

0 comments

The pith

A momentum proxy instruction enables fine-tuning of medical LVLMs using only image-description pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models rely on visual instruction tuning with image-instruction-output triplets, but medical applications struggle to create enough high-quality instructions due to the need for expert input. The paper introduces an instruction-free tuning method that substitutes a momentum proxy instruction for explicit text instructions during fine-tuning on simpler image-description pairs. This proxy preserves the model's pre-trained ability to follow instructions while directing updates to parameters that function correctly when real instructions appear at inference. A response shuffling strategy further prevents over-reliance on previous outputs. Tested on multiple-choice visual question answering, the approach reaches state-of-the-art accuracy on SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets.

Core claim

The core discovery is that fine-tuning large vision language models on image-description pairs alone, guided by a momentum proxy instruction instead of curated instructions, allows the model to maintain its instruction-following capability from pre-training. This enables the model to respond effectively to medical domain instructions at inference time. The addition of a response shuffling strategy reduces dependency on sequential word predictions, leading to improved performance on visual question answering tasks in medical imaging.

What carries the argument

The momentum proxy instruction, which replaces explicit instructions by using a dynamically updated version derived from the model's own outputs to guide fine-tuning on image-description pairs.

If this is right

Fine-tuning becomes feasible with readily available image-description pairs rather than expert-crafted instructions.
The fine-tuned model responds flexibly to domain-specific instructions at test time despite their absence during training.
Accuracy improves on multiple-choice VQA for SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets.
Overall efficiency of adapting LVLMs to medical domains increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may apply to other fields where expert-annotated instructions are costly or scarce.
Testing on open-ended response tasks could reveal whether the proxy maintains performance beyond multiple-choice formats.
Integrating this with parameter-efficient methods could further lower the resources needed for domain adaptation.
The response shuffling might address similar over-reliance issues in other sequence models.

Load-bearing premise

The momentum proxy instruction successfully maintains the pre-trained instruction-following capability while ensuring that updated parameters remain effective when actual instructions are used at inference.

What would settle it

A controlled experiment in which models fine-tuned with the momentum proxy are then tested on entirely new medical instructions never reflected in any proxy form; failure to follow those instructions would disprove the preservation of capability.

read the original abstract

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated image-instruction-output triplets. However, in the medical domain, constructing large-scale, high-quality instruction datasets is particularly challenging due to the need for specialized expert knowledge. To address this issue, we propose an instruction-free tuning approach that reduces reliance on handcrafted instructions, leveraging only image-description pairs for fine-tuning. Specifically, we introduce a momentum proxy instruction as a replacement for curated text instructions, which preserves the instruction-following capability of the pre-trained LVLM while promoting updates to parameters that remain valid during inference. Consequently, the fine-tuned LVLM can flexibly respond to domain-specific instructions, even though explicit instructions are absent during fine-tuning. Additionally, we incorporate a response shuffling strategy to mitigate the model's over-reliance on previous words, facilitating more effective fine-tuning. Our approach achieves state-of-the-art accuracy on multiple-choice visual question answering tasks across SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets, significantly enhancing the fine-tuning efficiency of LVLMs in medical domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The momentum proxy lets them train on image descriptions alone but the SOTA numbers rest on unshown details about how well it transfers to real medical instructions.

read the letter

The paper's main move is to drop curated instruction triplets for medical LVLMs and fine-tune instead on plain image-description pairs. They add a momentum-updated proxy instruction that is supposed to keep the model's ability to follow prompts at inference time, plus response shuffling to cut down on copying earlier tokens. This targets the real bottleneck in the domain: expert instructions are scarce and expensive to make. The reported outcome is higher accuracy on multiple-choice VQA for SKINCON, WBCAtt, CBIS, and MIMIC-CXR, which suggests the method can deliver usable gains without the usual annotation load. That combination of proxy and shuffling is the concrete new piece relative to prior medical LVLM work. The approach is straightforward and directly useful for anyone who has access to image-description data but not to high-quality instructions. The soft spot is the thin experimental record. The abstract gives no baselines, no ablation numbers on the proxy itself, no error bars, and no test of whether the fine-tuned model actually handles open-ended medical questions or just the multiple-choice format it was measured on. If the proxy is mostly regularizing toward generic captioning rather than medical reasoning patterns, the gains could be narrower than claimed. The stress-test worry about transfer to unseen instructions is reasonable until the full runs are shown. This is for groups working on domain adaptation of vision-language models who want lower annotation costs. A reader who needs a practical starting point for medical fine-tuning would find the idea worth trying, even if they have to fill in the missing controls themselves. It deserves peer review because the problem is concrete and the proposed fix is simple enough that referees can quickly check whether the numbers hold up.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an instruction-free tuning method for large vision-language models (LVLMs) in the medical domain. It fine-tunes models using only image-description pairs by introducing a momentum proxy instruction to substitute for curated instructions and a response shuffling strategy to reduce over-reliance on prior tokens. The central claim is that this preserves the pre-trained model's instruction-following capability, enabling flexible responses to domain-specific medical instructions at inference, and yields state-of-the-art accuracy on multiple-choice visual question answering across the SKINCON, WBCAtt, CBIS, and MIMIC-CXR datasets.

Significance. If the empirical results hold under rigorous verification, the approach could meaningfully reduce dependence on expert-annotated instruction data, a major bottleneck for medical LVLM adaptation. The momentum proxy and shuffling mechanisms offer concrete, potentially reusable ideas for efficient domain transfer. Credit is given for targeting a practical limitation and evaluating on four distinct medical VQA benchmarks.

major comments (3)

[Method (momentum proxy definition)] The momentum proxy instruction is presented as preserving instruction-following for unseen medical prompts, yet the manuscript provides no derivation, equivalence proof, or ablation demonstrating that the proxy-generated signal aligns parameters with medical reasoning patterns absent from pre-training. This assumption is load-bearing for the instruction-free claim.
[Experiments] The SOTA accuracy claim on SKINCON, WBCAtt, CBIS, and MIMIC-CXR is stated without reported baselines, ablation tables, error bars, or statistical significance tests. The experimental section must include these to substantiate the efficiency and performance gains.
[§3.3 (response shuffling)] The response shuffling strategy is introduced to mitigate token over-reliance, but no quantitative results isolate its contribution to handling medical terminology or multi-step reasoning in the reported VQA tasks.

minor comments (2)

[Method] Notation for the momentum update rule should be formalized with explicit equations to avoid ambiguity in how the proxy is computed from image-description pairs.
[Experiments] Dataset statistics (number of image-description pairs used per benchmark) and training hyperparameters are missing from the main text and should be added for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our instruction-free tuning method for medical LVLMs. We address each major comment below and will revise the manuscript to strengthen the presentation of the momentum proxy, experimental results, and response shuffling analysis.

read point-by-point responses

Referee: The momentum proxy instruction is presented as preserving instruction-following for unseen medical prompts, yet the manuscript provides no derivation, equivalence proof, or ablation demonstrating that the proxy-generated signal aligns parameters with medical reasoning patterns absent from pre-training. This assumption is load-bearing for the instruction-free claim.

Authors: We agree that a more rigorous justification is needed. The momentum proxy is formulated as an exponentially moving average of prior model-generated instructions to dynamically simulate instruction-following signals during fine-tuning on image-description pairs. In the revision we will add the explicit mathematical definition, a brief motivation section explaining its alignment with pre-trained instruction patterns, and an ablation comparing performance with and without the proxy on held-out medical prompts. revision: yes
Referee: The SOTA accuracy claim on SKINCON, WBCAtt, CBIS, and MIMIC-CXR is stated without reported baselines, ablation tables, error bars, or statistical significance tests. The experimental section must include these to substantiate the efficiency and performance gains.

Authors: We acknowledge the need for stronger empirical support. The revised manuscript will include (i) comparisons against standard fine-tuning and recent medical LVLM baselines, (ii) full ablation tables for the momentum proxy and shuffling components, (iii) error bars computed over multiple random seeds, and (iv) paired statistical significance tests (e.g., McNemar or t-tests) for the reported accuracy improvements. revision: yes
Referee: The response shuffling strategy is introduced to mitigate token over-reliance, but no quantitative results isolate its contribution to handling medical terminology or multi-step reasoning in the reported VQA tasks.

Authors: We will expand §3.3 with targeted ablations that isolate response shuffling. Specifically, we will report accuracy deltas with and without shuffling on each dataset, together with qualitative examples highlighting improvements in medical term accuracy and multi-step reasoning chains, thereby quantifying its contribution beyond the overall SOTA numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: method components are independent additions

full rationale

The paper presents the momentum proxy instruction and response shuffling as explicit engineering choices added to image-description pair fine-tuning; these are not defined in terms of the target outputs or fitted from the same evaluation data. No equations reduce any claimed prediction to an input by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the SOTA accuracy statements rest on reported empirical results across the listed datasets rather than on any renaming or ansatz smuggling. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that pre-trained LVLMs retain instruction-following ability through a proxy mechanism and that image-description pairs alone suffice for effective fine-tuning; no free parameters or invented entities with independent evidence are explicitly listed.

axioms (1)

domain assumption Pre-trained LVLMs possess instruction-following capability that can be preserved via a momentum proxy instruction during fine-tuning on image-description pairs.
Invoked to justify why the model can still respond to domain-specific instructions at inference despite their absence during training.

invented entities (1)

momentum proxy instruction no independent evidence
purpose: Acts as a replacement for curated text instructions to maintain instruction-following while allowing valid parameter updates.
New mechanism introduced to enable instruction-free tuning.

pith-pipeline@v0.9.0 · 5518 in / 1278 out tokens · 41486 ms · 2026-05-15T07:56:56.792070+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

[1]

Recently, LLaMA-3.2-Vision [16] extended LLaMA 3.1 by adapting cross-attention layers to integrate the image modality with language

integrated CLIP’s vision encoder [15] with LLaMA [11], and was fine-tuned on an instruction dataset generated by language-only GPT-4 [12]. Recently, LLaMA-3.2-Vision [16] extended LLaMA 3.1 by adapting cross-attention layers to integrate the image modality with language. Additionally, alternative LLMs have been integrated into the vision domain, such as Q...

work page
[2]

Describe this medical scan

and MedGemma [3] have been fine-tuned on carefully curated medical datasets from PMC and other publicly avail- able sources, significantly enhancing their medical capabilities. One study [18] attempted instruction-free tuning by using language-only instruction-output pairs and image-caption pairs with a fixed set of instructions. However, none of the exis...

work page
[3]

Describe

Proxy Instruction:Let the proxy instructiont= {t1, . . . , tN }be defined as a set ofNcontinuous vectors, each with the same dimensionality as the word embeddings. The proxy instruction replaces the text instruction (e.g., a question) in the promptX p with learnable vectors during fine-tuning, in order to preserve the LVLM’s pre-trained instruction-follow...

work page
[4]

This can lead to overfitting and over-reliance on proxy instructions, thereby degrading the inference performance of a fine-tuned LVLM

Momentum Proxy Instruction:Although the optimized in- structiontis well aligned with its corresponding descriptions, tis replaced with a conversational text instruction at infer- ence. This can lead to overfitting and over-reliance on proxy instructions, thereby degrading the inference performance of a fine-tuned LVLM. To mitigate this issue, we aim not t...

work page
[5]

plaque, scale

Notably, to improve efficiency, we first optimizetwhile keepinggfrozen as a warm-up stage, and then use the fine- tunedtto initialize ¯t, rather than using random initialization. Algorithm 1Instruction-free tuning process. 1:Input:Vision encoderg, language modelf, ground truth de- scriptiony, learning rateη, momentum coefficientα 2:t← N(0, σ 2) 3:whilenot...

work page 2011
[6]

Describe this medical scan

Main Results:We first compared our method against other LVLMs without fine-tuning (w/oFT) on the SKIN- CON, WBCAtt, and CBIS datasets, including general LVLMs such as LLaMA-3.2-11B-Vision-Instruct [16] and Qwen2.5- VL-3B-Instruct [17], as well as medical LVLMs such as PubMedVision-7B-Qwen2.5VL [4] and MedGemma-4B-it [3]. We then compared our method with f...

work page
[7]

Is there

Comparison with Instruction-Free Tuning Variants:We compared our method with two instruction-free tuning vari- 7 T ABLE I: Multiple-choice VQA accuracy on the SKINCON, WBCAtt, and CBIS datasets. We compared our method (InstFree) with BLIP-2 [32], MedGemma-4B [3], PubMedVision-7B [4], Qwen2.5-VL-3B [17], and LLaMA-3.2-11B-Vision [16]. FT denotes fine-tunin...

work page 1939
[8]

”) to demonstrate that accuracy degrades when response shuffling uses incorrect separators such as “

Ablation on Response Shuffling:We conducted ablation studies on response shuffling using the SKINCON, WBCAtt, and CBIS datasets. First, we compared InstFree w/ Bal (bal- anced sampling) to investigate whether the issue originates from the model overfitting to previous word correlations or recurring response patterns. We calculate the frequency of each wor...

work page
[9]

Discussion of Misalignment:As shown in Table IV (FTw/ Rand), fine-tuning the model to generate consistent responses across a broad range of instructions leads to a significant degradation of its pre-trained instruction-following capabil- ity. This degradation appears to stem from a misalignment between the fine-tuning dataset and the pre-training data, as...

work page
[10]

For the momentum coefficientα, we experimented with values of 0.9, 0.99, 0.999, and 0.9999, while setting the number of instruction vectorsNto 8

Ablation on Coefficients and Instruction Scale:We con- ducted ablation studies on the momentum proxy instruction using the SKINCON, WBCAtt, and CBIS datasets by varying the momentum coefficient and the number of instruction vec- tors to identify optimal hyperparameters. For the momentum coefficientα, we experimented with values of 0.9, 0.99, 0.999, and 0....

work page
[11]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

work page 2023
[12]

Lima: Less is more for alignment,

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yuet al., “Lima: Less is more for alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 55 006–55 021, 2023

work page 2023
[13]

MedGemma Technical Report

A. Sellergren, S. Kazemzadeh, T. Jaroensri, A. Kiraly, M. Traverse, T. Kohlberger, S. Xu, F. Jamil, C. Hughes, C. Lauet al., “Medgemma technical report,”arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Towards injecting medical visual knowledge into multimodal llms at scale,

J. Chen, C. Gui, R. Ouyang, A. Gao, S. Chen, G. Chen, X. Wang, Z. Cai, K. Ji, X. Wanet al., “Towards injecting medical visual knowledge into multimodal llms at scale,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 7346– 7370

work page 2024
[15]

A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources,

S. Moon, S. Pakhomov, N. Liu, J. O. Ryan, and G. B. Melton, “A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources,”Journal of the American Medical Informatics Association, vol. 21, no. 2, pp. 299–307, 2014

work page 2014
[16]

Challenges in clinical natural language processing for automated disorder normalization,

R. Leaman, R. Khare, and Z. Lu, “Challenges in clinical natural language processing for automated disorder normalization,”Journal of biomedical informatics, vol. 57, pp. 28–37, 2015

work page 2015
[17]

Toward best practices in radiology reporting,

C. E. Kahn Jr, C. P. Langlotz, E. S. Burnside, J. A. Carrino, D. S. Channin, D. M. Hovsepian, and D. L. Rubin, “Toward best practices in radiology reporting,”Radiology, vol. 252, no. 3, pp. 852–856, 2009

work page 2009
[18]

Multi-modal hallucination control by visual information grounding,

A. Favero, L. Zancato, M. Trager, S. Choudhary, P. Perera, A. Achille, A. Swaminathan, and S. Soatto, “Multi-modal hallucination control by visual information grounding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 303–14 312

work page 2024
[19]

Cross-task general- ization via natural language crowdsourcing instructions,

S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task general- ization via natural language crowdsourcing instructions,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3470–3487

work page 2022
[20]

Alpaca: A strong, replicable instruction- following model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca: A strong, replicable instruction- following model,”Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, vol. 3, no. 6, p. 7, 2023

work page 2023
[21]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Self-instruct: Aligning language models with self- generated instructions,

Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self- generated instructions,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13 484–13 508

work page 2023
[24]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892– 34 916, 2023

work page 2023
[25]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[26]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Vift: Towards visual instruction-free fine-tuning for large vision-language models,

Z. Liu, K. Zhou, X. Zhao, D. Gao, Y . Li, and J.-R. Wen, “Vift: Towards visual instruction-free fine-tuning for large vision-language models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 10 341–10 366

work page 2025
[29]

An image is worth 16x16 words: Trans- formers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

work page 2021
[30]

Flamingo: a visual language model for few-shot learning,

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynoldset al., “Flamingo: a visual language model for few-shot learning,”Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022

work page 2022
[31]

torchtune: Pytorch’s finetuning library,

torchtune, “torchtune: Pytorch’s finetuning library,” https://github.com/pytorch/torchtune, 2024

work page 2024
[32]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597

work page 2021
[33]

Momentum contrast for unsupervised visual representation learning,

K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738

work page 2020
[34]

Skincon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,

R. Daneshjou, M. Yuksekgonul, Z. R. Cai, R. Novoa, and J. Y . Zou, “Skincon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis,”Advances in Neural Information Processing Systems, vol. 35, pp. 18 157–18 167, 2022

work page 2022
[35]

Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset,

M. Groh, C. Harris, L. Soenksen, F. Lau, R. Han, A. Kim, A. Koochek, and O. Badri, “Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1820–1828

work page 2021
[36]

Wbcatt: A white blood cell dataset annotated with detailed morphological attributes,

S. Tsutsui, W. Pang, and B. Wen, “Wbcatt: A white blood cell dataset annotated with detailed morphological attributes,”Advances in Neural Information Processing Systems, vol. 36, pp. 50 796–50 824, 2023

work page 2023
[37]

A curated mammography data set for use in computer-aided detection and diagnosis research,

R. S. Lee, F. Gimenez, A. Hoogi, K. K. Miyake, M. Gorovoy, and D. L. Rubin, “A curated mammography data set for use in computer-aided detection and diagnosis research,”Scientific data, vol. 4, no. 1, pp. 1–9, 2017

work page 2017
[38]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports,

A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports,”Scientific data, vol. 6, no. 1, p. 317, 2019

work page 2019
[39]

A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings,

J. M. Zambrano Chaves, S.-C. Huang, Y . Xu, H. Xu, N. Usuyama, S. Zhang, F. Wang, Y . Xie, M. Khademi, Z. Yanget al., “A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings,”Nature Communications, vol. 16, no. 1, p. 3108, 2025

work page 2025
[40]

Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images,

N. Gaggion, C. Mosquera, L. Mansilla, J. M. Saidman, M. Aineseder, D. H. Milone, and E. Ferrante, “Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images,” Scientific Data, vol. 11, no. 1, p. 511, 2024

work page 2024
[41]

Interpretable medical image visual question answering via multi-modal relationship graph learning,

X. Hu, L. Gu, K. Kobayashi, L. Liu, M. Zhang, T. Harada, R. M. Summers, and Y . Zhu, “Interpretable medical image visual question answering via multi-modal relationship graph learning,”Medical Image Analysis, vol. 97, p. 103279, 2024

work page 2024
[42]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inInternational conference on machine learning. PMLR, 2023, pp. 19 730–19 742

work page 2023
[43]

Context- parametric inversion: Why instruction finetuning may not actually im- prove context reliance,

S. Goyal, C. Baek, J. Z. Kolter, and A. Raghunathan, “Context- parametric inversion: Why instruction finetuning may not actually im- prove context reliance,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025