pith. machine review for the scientific record. sign in

arxiv: 2604.02748 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords brain MRIlarge language modelvisual instruction tuningimage segmentationimage translationvisual question answeringmedical report generation
0
0 comments X

The pith

One finetuned language model outperforms specialized models on four brain MRI tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops LLaBIT, a large language model finetuned on visual instructions for brain magnetic resonance imaging tasks. It tackles spatial detail loss in turning images into tokens by reusing the encoder's feature maps and creates more training examples by having LLMs write text descriptions under strict rules. Tested on five different brain MRI datasets, the model performs well on generating radiology reports, answering questions about images, segmenting structures, and translating between image types, and it does better than models made just for one of those jobs. A reader would care if this means future medical AI can use fewer, more flexible systems instead of many narrow ones.

Core claim

LLaBIT extends the visual reasoning of LLMs to clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-s

What carries the argument

Feature map reuse from the image encoder to preserve spatial information during tokenization, combined with LLM-generated text data for training augmentation.

If this is right

  • The model achieves superior results on report generation, visual question answering, segmentation, and translation compared to prior methods.
  • It outperforms dedicated task-specific models in head-to-head tests on brain MRI data.
  • Reusing encoder feature maps enables accurate spatial tasks without additional architectural changes.
  • LLM-based text generation effectively expands the available training data for medical vision-language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapting this single-model approach to other medical imaging areas could simplify analysis tools across healthcare.
  • Hospitals might reduce the number of AI systems they maintain by adopting versatile models like this.
  • The technique of reusing features could help other vision-language models that suffer from tokenization losses.
  • Applying the model to real patient data with noise or artifacts would test its practical reliability beyond controlled datasets.

Load-bearing premise

Reusing feature maps from the image encoder is enough to avoid losing the spatial details needed for good segmentation and translation results.

What would settle it

If experiments on the same datasets show the LLaBIT model's segmentation accuracy or translation quality is lower than a specialized segmentation network or translation model, the claim of outperformance would not hold.

Figures

Figures reproduced from arXiv: 2604.02748 by Hyunjin Park, Jonghun Kim, Sinyoung Ra.

Figure 1
Figure 1. Figure 1: Example of LLaBIT performing versatile tasks on brain MR images. LLaBIT supports report generation and image-to-image tasks. input, but also to produce it as output [30]. These methods have great potential in the medical field, where image understanding requires context, reasoning, and the ability to communicate findings in both text and image. A key method for adapting LLMs to new tasks without losing the… view at source ↗
Figure 2
Figure 2. Figure 2: Text data generation with LLMs on a dataset with only images. Images and captions are processed by LLMs with strict predefined instructions and few-shot samples selected by clinicians to generate reports and VQA results. The output of each model is accepted or rejected using GPT 4o and the final report and VQA are regenerated based on this feedback. cused on CXR datasets, where paired image-text data is av… view at source ↗
Figure 3
Figure 3. Figure 3: Instruction tuning pipeline. Both text and images are tokenized and fed into the LLM, which can generate either text or image tokens as output. The LLM’s vocabulary is extended to include image tokens in addition to text tokens. The instruction is provided to the LLM as text tokens. The image is converted into quantized tokens using a VQ encoder and fed into the LLM along with an <input> token. These image… view at source ↗
Figure 4
Figure 4. Figure 4: Fine-tuning of VQ-GAN with zero skip connection. The skip con￾nection is fine-tuned, while freezing the image encoder and decoder. A zero con￾volution block is adopted using a BiomedCLIP text encoder and prompt tuning to flexibly adapt to the target. according to the instruction tuning approach [48, 11]. For a sequence of length L, the training loss is given by: Linstruct = − log(p(Xresponse|Ximg, Xinstruc… view at source ↗
Figure 5
Figure 5. Figure 5: Loss functions for image-to-image tasks. (a) The translation task is trained using reconstruction loss. (b) The segmentation task is trained using Dice loss with an additional layer. target text. This prompt is fed into the text encoder and interacts with the image features through cross-attention. We then compute the feature of the i-th skip connection as follows: f skip i = ZeroConv(f enc i ,P), (6) wher… view at source ↗
Figure 6
Figure 6. Figure 6: Examples of report generation and VQA from various MLLMs. The highlighted text reflects the unique features of each image. For comparison, each image displays its modality, plane, and abnormality information at the top. Top: an example of report generation; Bottom: an example of VQA. using the Dice score. For image translation, we used PSNR, SSIM, and FID [22] as metrics. The model was trained on four A100… view at source ↗
Figure 8
Figure 8. Figure 8: T1 → T2 translation re￾sults. Top: UPENN-GBM. Bottom: IXI. data, on the same image-text paired dataset. Quantitative results are shown in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of two initialization methods in the skip connection block for T2 → T1ce translation. With Kaiming initialization, the image degrades rapidly early on. Zero initialization leads to a gradual increase in detail [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The formats of simple caption for LLM to generate text data. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt for report data generation [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The prompt for VQA data generation [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The prompt for data evaluation. - Generate free-text radiology reports for the provided brain MR images. - Use the provided brain MR images to create corresponding free-text radiology reports. - Based on the provided brain MR images, produce free-text radiology reports. - Create free-text radiology reports that correspond to the provided brain MR images. - Utilize the provided brain MR images to generate … view at source ↗
Figure 14
Figure 14. Figure 14: Instruction list for report generation task. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Instruction list for segmentation task [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Instruction list for image translation task. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LLaBIT, a visual instruction-finetuned large language model for brain MRI that handles four tasks—report generation, visual question answering, image segmentation, and image translation—across five datasets. It augments limited paired data via LLM-generated text under strict instructions and reuses feature maps from the image encoder to reduce spatial degradation during tokenization. The central claim is that this unified model achieves superior performance over specialized task-specific models in direct comparisons.

Significance. If the performance claims hold under rigorous evaluation, the work would be significant for demonstrating a single versatile model that unifies clinically important brain MRI tasks, potentially reducing the need for separate specialized networks and improving data efficiency through LLM-based text augmentation.

major comments (2)
  1. [Abstract] Abstract: the headline claim that the model 'outperformed specialized, task-specific models in direct comparisons' is unsupported by any quantitative metrics, error bars, dataset sizes, or ablation results, rendering the central efficacy assertion unverifiable from the provided text.
  2. [Method (feature reuse description)] The feature-reuse mechanism for mitigating tokenization-induced spatial loss (described as reusing image-encoder feature maps) is load-bearing for the segmentation and translation results, yet the manuscript supplies no ablation removing this path, no spatial-fidelity metrics such as reconstruction PSNR at native resolution, and no boundary-precision evaluation.
minor comments (1)
  1. [Abstract] The abstract states evaluation on 'five brain MRI datasets' but provides no names, sizes, or splits; this information should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation of results and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the model 'outperformed specialized, task-specific models in direct comparisons' is unsupported by any quantitative metrics, error bars, dataset sizes, or ablation results, rendering the central efficacy assertion unverifiable from the provided text.

    Authors: We agree that the abstract should include key quantitative support for the central claim to allow immediate verification. The full manuscript contains detailed tables with metrics (e.g., Dice scores, BLEU, PSNR), standard deviations across runs, dataset sizes, and ablation studies in Sections 4 and 5. We have revised the abstract to concisely report representative results, including average improvements over task-specific baselines with error bars and dataset references. revision: yes

  2. Referee: [Method (feature reuse description)] The feature-reuse mechanism for mitigating tokenization-induced spatial loss (described as reusing image-encoder feature maps) is load-bearing for the segmentation and translation results, yet the manuscript supplies no ablation removing this path, no spatial-fidelity metrics such as reconstruction PSNR at native resolution, and no boundary-precision evaluation.

    Authors: The feature-reuse path is indeed critical for spatial tasks. We have added an ablation study in the revised manuscript that removes this mechanism and reports the resulting performance drop on segmentation and translation. We now also include spatial-fidelity metrics (native-resolution PSNR and SSIM for translation) and boundary-precision metrics (Hausdorff distance and surface Dice for segmentation) in the evaluation tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper presents an empirical ML method (LLaBIT) for multi-task brain MRI processing via visual instruction fine-tuning, with a feature-reuse mechanism described to address tokenization loss. No equations, derivations, or predictions are offered that reduce to fitted parameters or inputs by construction. Performance claims rest on direct evaluations across external datasets rather than self-referential loops, self-citations that bear the central load, or ansatzes smuggled via prior work. The reader's assessment of score 2.0 is consistent with the absence of any load-bearing circular step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about LLM fine-tuning effectiveness and the utility of feature map reuse for preserving spatial information; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption LLMs can be effectively instruction-finetuned for vision-language tasks in the medical imaging domain
    Invoked when extending visual reasoning of LLMs to clinical tasks like segmentation and translation.

pith-pipeline@v0.9.0 · 5520 in / 1142 out tokens · 34239 ms · 2026-05-13T20:35:24.076652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Advances in neural information processing systems (2022)

    Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems (2022)

  3. [3]

    M3d:Ad- vancing 3d medical image analysis with multi-modal large language models

    Bai, F., et al.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)

  4. [4]

    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

    Baid, U., et al.: The rsna-asnr-miccai brats 2021 benchmark on brain tumor seg- mentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 (2021)

  5. [5]

    The Cancer Imaging Archive (2021)

    Bakas, S., et al.: Multi-parametric magnetic resonance imaging (mpmri) scans for de novo glioblastoma (gbm) patients from the university of pennsylvania health system (upenn-gbm). The Cancer Imaging Archive (2021)

  6. [6]

    Association for Computational Linguistics (Jun 2005)

    Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improvedcorrelationwithhumanjudgments.In:ProceedingsoftheACLWorkshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics (Jun 2005)

  7. [7]

    Radiology152(3), 695–702 (1984)

    Bradley Jr, W.G., et al.: Comparison of ct and mr in 400 patients with suspected disease of the brain and cervical spinal cord. Radiology152(3), 695–702 (1984)

  8. [8]

    Advances in neural in- formation processing systems33, 1877–1901 (2020)

    Brown, T., et al.: Language models are few-shot learners. Advances in neural in- formation processing systems33, 1877–1901 (2020)

  9. [9]

    American Journal of Neuroradiology27(3), 475–487 (2006)

    Cha, S.: Update on brain tumor imaging: from anatomy to physiology. American Journal of Neuroradiology27(3), 475–487 (2006)

  10. [10]

    arXiv preprint arXiv:2403.08002 , year=

    Chaves, J.M.Z., et al.: Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. arXiv preprint arXiv:2403.08002 (2024)

  11. [11]

    Conover, M., et al.: Free dolly: Introducing the world’s first truly open instruction- tuned llm (2023)

  12. [12]

    In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024)

    Cui, H., et al.: Biomedical visual instruction tuning with clinician preference align- ment. In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024)

  13. [13]

    Damadian,R.: Tumor detectionbynuclearmagnetic resonance.Science171(3976), 1151–1153 (1971)

  14. [14]

    databricks

    Dolly, F.: Introducing the world’s first truly open instruction-tuned llm. databricks. com (2023)

  15. [15]

    In: Forty-first International Conference on Machine Learning (2023) 14 J

    Du, Y., et al.: Improving factuality and reasoning in language models through multiagent debate. In: Forty-first International Conference on Machine Learning (2023) 14 J. Kim et al

  16. [16]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

    Du, Y., et al.: Segvol: Universal and interactive volumetric medical image segmen- tation. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

  17. [17]

    Neural networks (2018)

    Elfwing, S., et al.: Sigmoid-weighted linear units for neural network function ap- proximation in reinforcement learning. Neural networks (2018)

  18. [18]

    Neuro-oncology17(9), 1188–1198 (2015)

    Ellingson, B.M., et al.: Consensus recommendations for a standardized brain tumor imaging protocol in clinical trials. Neuro-oncology17(9), 1188–1198 (2015)

  19. [19]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., et al.: Taming transformers for high-resolution image synthesis. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

  20. [20]

    Grattafiori, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/ 2407.21783

  21. [21]

    In: Proceedings of the IEEE international conference on computer vision

    He, K., et al.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)

  22. [22]

    Advances in neural information processing systems30(2017)

    Heusel, M., et al.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  23. [23]

    In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition

    Isola, P., et al.: Image-to-image translation with conditional adversarial networks. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 1125–1134 (2017)

  24. [24]

    In: European conference on computer vision

    Jia, M., et al.: Visual prompt tuning. In: European conference on computer vision. pp. 709–727. Springer (2022)

  25. [25]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024)

    Kim, J., Park, H.: Adaptive latent diffusion model for 3d medical image to image translation: Multi-modal magnetic resonance imaging study. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024)

  26. [26]

    Computer Methods and Programs in Biomedicine p

    Kim, J., Park, H.: Visual prompt tuning for task-flexible medical image synthesis. Computer Methods and Programs in Biomedicine p. 109244 (2026)

  27. [27]

    Computerized Medical Imaging and Graphics p

    Kim, J., et al.: Enhancing intracranial vessel segmentation using diffusion models without manual annotation for 3d time-of-flight magnetic resonance angiography. Computerized Medical Imaging and Graphics p. 102651 (2025)

  28. [28]

    Computer Methods and Programs in Biomedicine269, 108881 (2025)

    Kim, J., et al.: Weakly-supervised segmentation using sparse single point anno- tations for lumen and wall of carotid arteries in 3d mri. Computer Methods and Programs in Biomedicine269, 108881 (2025)

  29. [29]

    In: Bengio, Y., LeCun, Y

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations (2015)

  30. [30]

    Advances in Neural Information Processing Systems36, 21487–21506 (2023)

    Koh, J.Y., et al.: Generating images with multimodal language models. Advances in Neural Information Processing Systems36, 21487–21506 (2023)

  31. [31]

    In: The Eleventh International Conference on Learning Representations (2023)

    Kwon, G., Ye, J.C.: Diffusion-based image translation using disentangled style and content representation. In: The Eleventh International Conference on Learning Representations (2023)

  32. [32]

    Computer Methods and Programs in Biomedicine255, 108338 (2024)

    Kwon, J., et al.: Leveraging segmentation-guided spatial feature embedding for overall survival prediction in glioblastoma with multimodal magnetic resonance imaging. Computer Methods and Programs in Biomedicine255, 108338 (2024)

  33. [33]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Kwon, J., et al.: Blood pressure assisted cerebral microbleed segmentation via meta-matching. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 77–86. Springer (2025)

  34. [34]

    LaBella, D., et al.: The asnr-miccai brain tumor segmentation (brats) challenge 2023: Intracranial meningioma (2023)

  35. [35]

    European Radiology pp

    Lee, S., et al.: Cxr-llava: a multimodal large language model for interpreting chest x-ray images. European Radiology pp. 1–13 (2025) Visual Instruction-Finetuned LM for Brain MRI 15

  36. [36]

    In: The Twelfth International Confer- ence on Learning Representations (2024)

    Lee, S., Kim, W.J., Chang, J., Ye, J.C.: LLM-CXR: Instruction-finetuned LLM for CXR image understanding and generation. In: The Twelfth International Confer- ence on Learning Representations (2024)

  37. [37]

    Advances in Neural Information Processing Systems (2024)

    Li, C., et al.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems (2024)

  38. [38]

    Scientific data9(1), 320 (2022)

    Liew, S.L., et al.: A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Scientific data9(1), 320 (2022)

  39. [39]

    In: Text Summarization Branches Out

    Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. Association for Computational Linguistics (2004)

  40. [40]

    In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  41. [41]

    In: International Conference on Learning Representations (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

  42. [42]

    In: Machine Learning for Health (ML4H)

    Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. In: Machine Learning for Health (ML4H). pp. 353–367. PMLR (2023)

  43. [43]

    Computer Methods and Programs in Biomedicine265, 108765 (2025)

    Ra, S., et al.: Enhancing radiomics features via a large language model for classi- fying benign and malignant breast tumors in mammography. Computer Methods and Programs in Biomedicine265, 108765 (2025)

  44. [44]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

  45. [45]

    In: International conference on machine learning

    Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR (2021)

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., et al.: High-resolution image synthesis with latent diffusion mod- els. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  47. [47]

    In: Medical image computing and computer-assisted in- tervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical image computing and computer-assisted in- tervention. pp. 234–241. Springer (2015)

  48. [48]

    Taori, R., et al.: Stanford alpaca: An instruction-following llama model (2023)

  49. [49]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  50. [50]

    arXiv preprint arXiv:2306.07971 (2023)

    Thawkar, O., et al.: Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 (2023)

  51. [51]

    Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models (2023), https://arxiv.org/abs/2307.09288

  52. [52]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  53. [53]

    Advances in neural information processing systems30(2017)

    Van Den Oord, A., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017)

  54. [54]

    In: International Conference on Learning Representations (2022)

    Wei, J., et al.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022)

  55. [55]

    In: Proceedings of the IEEE/CVF international conference on computer vision (2023)

    Zhang, L., et al.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision (2023)

  56. [56]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Zhang,S.,etal.:Biomedclip:amultimodalbiomedicalfoundationmodelpretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915

  57. [57]

    In: International Conference on Learning Representations (2020)

    Zhang, T., et al.: Bertscore: Evaluating text generation with bert. In: International Conference on Learning Representations (2020)

  58. [58]

    type": "text

    Zhu, D., et al.: MiniGPT-4: Enhancing vision-language understanding with ad- vanced large language models. In: The Twelfth International Conference on Learn- ing Representations (2024) 16 J. Kim et al. A Appendix B Dataset Preprocessing.All 3D images were resampled to a uniform resolution of1×1× 1mm3. Then, 2D slices centered on the region of interest (RO...