Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
Pith reviewed 2026-05-20 05:33 UTC · model grok-4.3
The pith
Existing visual attribution methods often fail to identify the actual evidence used by LVLMs for chest X-ray reasoning, while MedFocus succeeds by localizing anatomical concepts and measuring their causal effects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that standard visual attribution methods do not reliably recover the visual evidence underlying LVLM predictions on chest X-rays. A causal evaluation framework filters the dataset to samples where counterfactual editing of expert-annotated regions demonstrably alters model outputs, exposing that most existing methods misalign with these causal regions. MedFocus corrects the mismatch by first localizing anatomical concepts through unbalanced optimal transport and then assessing their causal impact on model outputs with targeted interventions, thereby generating multi-level attributions that are more faithful to the model's internal reasoning.
What carries the argument
MedFocus, which localizes clinically meaningful anatomical regions via unbalanced optimal transport and quantifies their causal effects on LVLM outputs through targeted interventions.
If this is right
- Models that pass the causal filter can be paired with MedFocus to produce explanations that clinicians can verify against image content.
- The same counterfactual framework can rank future attribution techniques by how well they recover regions that actually change predictions.
- Multi-level outputs from MedFocus allow users to inspect attributions at the level of whole regions, specific concepts, or individual tokens.
- Improved grounding reduces the risk that an LVLM bases a medical answer on irrelevant image areas.
Where Pith is reading between the lines
- The approach could be extended to other imaging modalities such as CT or MRI by retraining the anatomical concept localizer on new expert annotations.
- Integrating the causal measurement step into model training might encourage LVLMs to rely more consistently on clinically relevant regions.
- The framework offers a template for auditing explanation methods in any high-stakes domain where counterfactual edits are feasible.
- Token-level attributions from MedFocus might help diagnose cases where the model attends to text prompts rather than image content.
Load-bearing premise
Counterfactual editing of the expert-annotated region cleanly isolates its causal contribution to the model's prediction without creating new artifacts or unintended side effects.
What would settle it
Apply any attribution method to a held-out CXR-VQA sample, then perform the same counterfactual edit on the region highlighted by that method instead of the expert region; if the change in model output is substantially smaller or absent, the attribution method is shown to be unfaithful.
Figures
read the original abstract
Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a causal evaluation framework for visual attribution in LVLMs on chest X-ray reasoning tasks. It filters CXR-VQA samples to retain only those where expert-annotated regions are verified as causally responsible for model predictions via counterfactual editing, evaluates 11 attribution methods across 6 open-source LVLMs in direct-answer and step-by-step modes, and introduces MedFocus, a concept-based method that localizes anatomical regions via unbalanced optimal transport and measures causal effects through targeted interventions. The central claim is that existing methods often fail to identify the evidence used by LVLMs while MedFocus substantially outperforms them, producing spatial, concept-level, and token-level attributions.
Significance. If the causal framework and comparisons hold, the work provides a more rigorous way to assess whether attribution methods reflect actual model reasoning in medical LVLMs and demonstrates a stronger alternative in MedFocus. This could improve trustworthiness of explanations in clinical applications by emphasizing causal verification over correlational attributions.
major comments (2)
- [causal evaluation framework] Causal evaluation framework (abstract and methods description): The filtering of samples based on counterfactual editing assumes that region edits (e.g., masking or perturbation) cleanly isolate causal effects without altering global image statistics, introducing artifacts, or triggering unrelated behaviors in the LVLM's joint vision-language space. This premise is load-bearing for all downstream comparisons of the 11 methods and MedFocus, yet the manuscript provides no details on edit implementation, checks for unintended global changes, or sensitivity analyses across edit types.
- [results] Results across 11 methods and 6 models: The claim of outperformance lacks reported statistical tests, effect sizes, or confidence intervals on the attribution accuracy metrics, making it difficult to assess whether MedFocus's gains are robust or could be explained by biases in the filtered dataset.
minor comments (2)
- [abstract] The abstract and methods should clarify the exact counterfactual edit procedure (e.g., masking strategy, perturbation strength) and any controls for preserving non-target image properties.
- [related work] Missing references to prior work on counterfactual interventions in vision-language models or limitations of optimal transport in medical imaging contexts.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below with point-by-point responses. Revisions have been made to incorporate additional details and statistical analyses as suggested, strengthening the presentation of the causal framework and results.
read point-by-point responses
-
Referee: [causal evaluation framework] Causal evaluation framework (abstract and methods description): The filtering of samples based on counterfactual editing assumes that region edits (e.g., masking or perturbation) cleanly isolate causal effects without altering global image statistics, introducing artifacts, or triggering unrelated behaviors in the LVLM's joint vision-language space. This premise is load-bearing for all downstream comparisons of the 11 methods and MedFocus, yet the manuscript provides no details on edit implementation, checks for unintended global changes, or sensitivity analyses across edit types.
Authors: We appreciate the referee highlighting the importance of substantiating the assumptions underlying our counterfactual editing procedure. The original manuscript described the high-level approach but provided limited implementation specifics. In the revised version, we have added a dedicated subsection in the Methods that details the edit implementation, including the exact masking (region zeroing with boundary smoothing) and perturbation (Gaussian noise at controlled variance) techniques. We now include quantitative checks for unintended global changes by reporting pre- and post-edit differences in global statistics such as mean pixel intensity, standard deviation, and CLIP feature cosine similarity. Additionally, we present sensitivity analyses across edit types and strengths, demonstrating that the causal verification outcomes remain stable. These revisions directly address the load-bearing premise and enhance the framework's transparency. revision: yes
-
Referee: [results] Results across 11 methods and 6 models: The claim of outperformance lacks reported statistical tests, effect sizes, or confidence intervals on the attribution accuracy metrics, making it difficult to assess whether MedFocus's gains are robust or could be explained by biases in the filtered dataset.
Authors: We acknowledge that the original results would benefit from greater statistical rigor to support the outperformance claims. In the revised manuscript, we have added paired statistical tests (Wilcoxon signed-rank tests with Bonferroni correction) comparing MedFocus against the 11 baseline methods across all six LVLMs and both output modes. We report p-values, effect sizes (Cohen's d), and 95% confidence intervals obtained via bootstrapping for the attribution accuracy metrics. These analyses confirm that the observed gains are statistically significant and consistent, reducing the likelihood that they arise from biases in the filtered dataset, which is constructed uniformly for all methods. revision: yes
Circularity Check
No circularity: evaluation uses external expert annotations and interventions
full rationale
The paper's causal evaluation framework filters CXR-VQA samples based on expert-annotated regions verified via counterfactual editing to confirm causal responsibility for model predictions. This relies on independent external annotations and interventions rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. MedFocus is introduced as a new concept-based method using unbalanced optimal transport for localization and targeted interventions for causal measurement, with performance compared across 11 methods, 6 LVLMs, and two output modes. No derivation step reduces by construction to the paper's own inputs or prior self-citations; the claims rest on external benchmarks and direct comparisons.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Counterfactual editing on CXR images can isolate the causal impact of specific regions on LVLM predictions without introducing confounding changes
invented entities (1)
-
MedFocus
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Quantifying attention flow in transformers
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. U...
-
[2]
In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)
Suhyun Ahn, Wonjung Park, Jihoon Cho, and Jinah Park. V olumetric conditioning module to control pretrained diffusion models for 3d medical images. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 85–95, 2025. doi: 10.1109/ W ACV61041.2025.00019
-
[3]
Lang, Benedikt Wiestler, Julia A
Malek Ben Alaya, Daniel M. Lang, Benedikt Wiestler, Julia A. Schnabel, and Cosmin I. Bercea. Mededit: Counterfactual diffusion-based image editing on brain mri. In Virginia Fernandez, Jelmer M. Wolterink, David Wiesner, Samuel Remedios, Lianrui Zuo, and Adrià Casamitjana, editors,Simulation and Synthesis in Medical Imaging, pages 167–176, Cham, 2025. Spri...
work page 2025
-
[4]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
work page 2018
-
[5]
AI Magazine , month = mar, pages =
Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, Mar. 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564
-
[6]
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone. 0130140. URLhttps://doi.org/10.1371/journal.pone.0130140
-
[7]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. URLhttps://arxiv.org/abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
-
[9]
URLhttps://arxiv.org/abs/2502.13923. 10
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Network dissec- tion: Quantifying interpretability of deep visual representations
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
work page 2017
-
[11]
SIAM Journal on Scientific Computing , author =
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems.SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015. doi: 10.1137/141000439. URL https: //doi.org/10.1137/141000439
-
[12]
Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay. Making the most of text semantics to improve biomedical vision–language processing. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Fa...
work page 2022
-
[13]
Katarzyna Borys, Yasmin Alyssa Schmitt, Meike Nauta, Christin Seifert, Nicole Krämer, Christoph M. Friedrich, and Felix Nensa. Explainable ai in medical imaging: An overview for clinical practitioners – beyond saliency-based xai approaches.European Journal of Radiology, 162:110786, 2023. ISSN 0720-048X. doi: https://doi.org/10.1016/j.ejrad.2023.110786. UR...
-
[14]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks
Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847,
-
[16]
doi: 10.1109/W ACV .2018.00097
work page doi:10.1109/w 2018
-
[17]
Transformer interpretability beyond attention visualization
Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, June 2021
work page 2021
-
[18]
Chexagent: Towards a foundation model for chest x-ray interpretation
Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S Chaudhari, and Curtis Langlotz. Chexagent: Towards a foundation model for ches...
work page 2024
-
[19]
URLhttps://openreview.net/forum?id=P3LOmrZWGR
-
[20]
Scaling algorithms for unbalanced optimal transport problems, 2018
Lenaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems, 2018. URL https://doi.org/10. 1090/mcom/3303
work page 2018
-
[21]
Lénaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulations.Journal of Functional Analysis, 274 (11):3090–3123, 2018. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2018.03.008. URL https://www.sciencedirect.com/science/article/pii/S0022123618301058
-
[22]
Sinkhorn distances: Lightspeed computation of optimal transport
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Pape...
work page 2013
-
[23]
Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding, 163:90–100, 2017. ISSN 1077-3142. doi: https://doi.org/ 10.1016/j.cviu.2017.10.001. URL https://www.sciencedirect.com/science/article/ pii/S107...
-
[24]
Daniel Coelho de Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas-Serrano, Javier Alvarez-Valle, Joaquín Galant-Herrero, and Antonio Pertusa. Padchest-gr: A bilingual chest x-ray dataset for groun...
-
[25]
Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017
work page 2017
-
[26]
Towards automatic concept- based explanations
Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/...
work page 2019
-
[27]
Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215. URL https://www.ahajournals.org/doi/abs/10.1161/ 01.CIR.101.23.e215
-
[28]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering, 2020. URL https://arxiv.org/abs/ 2003.10286
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[29]
Juan Eugenio Iglesias and Mert R. Sabuncu. Multi-atlas segmentation of biomedical images: A survey.Medical Image Analysis, 24(1):205–219, 2015. ISSN 1361-8415. doi: https: //doi.org/10.1016/j.media.2015.06.012. URL https://www.sciencedirect.com/science/ article/pii/S1361841515000997
-
[30]
Comeau, Robert Leaman, Charalampos S
Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, and Zhiyong Lu. Med-v1: Small language models for zero-shot and scalable biomedical evidence attribution, 2026. URL https://arxiv.org/abs/2603. 05308
work page 2026
-
[31]
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019
work page 2019
-
[32]
MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs
Alistair EW Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[33]
Do explanations explain? model knows best
Ashkan Khakzar, Pedram Khorsandi, Rozhin Nobahari, and Nassir Navab. Do explanations explain? model knows best. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10244–10253, June 2022
work page 2022
-
[34]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machin...
work page 2018
-
[35]
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedin...
work page 2020
-
[36]
Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, and Himabindu Lakkaraju. The disagreement problem in explainable machine learning: A practitioner’s perspective.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=jESY2WTZCe
work page 2024
-
[37]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, June 2024
work page 2024
-
[38]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018
work page 2018
-
[39]
Llava-med: Training a large language-and-vision assistant for biomedicine in one day
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, ...
work page 2023
-
[40]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...
work page 2023
-
[41]
A survey of state of the art large vision language models: Benchmark evaluations and challenges
Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1587–1606, June 2025
work page 2025
-
[42]
arXiv preprint arXiv:2511.19046 (2025)
Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delving into segment anything with medical concepts, 2025. URL https://arxiv.org/abs/2511.19046
-
[43]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...
work page 2023
-
[44]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV...
work page 2024
-
[45]
A unified approach to interpreting model predictions
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 8a20a8621...
work page 2017
-
[46]
Groma: Localized visual tokenization for grounding multimodal large language models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 417–435, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031- 72658-3
work page 2024
-
[47]
Segment anything in medical images.Nature communications, 15(1):654, 2024
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature communications, 15(1):654, 2024
work page 2024
-
[48]
Medsam2: Segment anything in 3d medical images and videos,
Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, and Bo Wang. Medsam2: Segment anything in 3d medical images and videos, 2025. URLhttps://arxiv.org/abs/2504.03600
-
[49]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
work page 2016
-
[50]
Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023
Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023
work page 2023
-
[51]
Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022
work page 2022
-
[52]
Capa- bilities of gpt-4 on medical challenge problems, 2023
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems, 2023. URL https://arxiv.org/abs/2303. 13375
work page 2023
-
[53]
Jonggwon Park, Byungmu Yoon, Soobum Kim, and Kyoyun Choi. Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi- task capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=WQq5JPGQ0C
work page 2025
-
[54]
Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance
Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance. InMedical Imaging with Deep Learning, 2025. URL https://openreview.net/ forum?id=trUvr1gSNI
work page 2025
-
[55]
Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=lLmqxkfSIw
work page 2024
-
[56]
Sanchez, Boris van Breugel, Daniel C
Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, and Maximilian Ilse. Radedit: Stress-testing biomedical vision models via diffusion image editing. In Aleš Leonar...
work page 2024
-
[57]
RISE: randomized input sampling for explanation of black-box models
Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. InBritish Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 151. BMV A Press, 2018. URL http://bmvc2018.org/ contents/papers/1064.pdf
work page 2018
-
[58]
Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019
Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019
work page 2019
-
[59]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 14
work page 2015
-
[60]
Rikard Rosenbacke, Åsa Melhus, Martin McKee, and David Stuckler. How explainable artificial intelligence can increase or decrease clinicians’ trust in ai applications in health care: Systematic review.JMIR AI, 3:e53207, Oct 2024. ISSN 2817-1705. doi: 10.2196/53207. URL https://ai.jmir.org/2024/1/e53207
-
[61]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017
work page 2017
-
[63]
Learning important features through propagating activation differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017. URLhttps://proceedings. ...
work page 2017
-
[64]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. URL https://arxiv.org/ abs/1312.6034
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[65]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
work page 2023
-
[66]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3319–3328. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ sundararajan17a.html
work page 2017
-
[67]
Interactive and explainable region-guided radiology report generation
Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7433–7442, June 2023
work page 2023
-
[68]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...
-
[70]
Wei Wang, John A. Ozolek, Dejan Slepˇcev, Ann B. Lee, Cheng Chen, and Gustavo K. Rohde. An optimal transportation approach for nuclear structure-based pathology.IEEE Transactions on Medical Imaging, 30(3):621–631, 2011. doi: 10.1109/TMI.2010.2089693
-
[71]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation, 2025. URL https://arxiv.org/abs/2506.18871
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Chest imagenome dataset for clinical reasoning
Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, and Mehdi Moradi. Chest imagenome dataset for clinical reasoning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...
work page 2021
-
[73]
Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025
Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models, 2025. URL https://arxiv.org/abs/2503.12799
-
[74]
Cares: A comprehensive 16 benchmark of trustworthiness in medical vision language models
Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, and Huaxiu Yao. Cares: A comprehensive 16 benchmark of trustworthiness in m...
-
[75]
doi: 10.52202/079017-4455. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/fde7f40f8ced5735006810534dc66b33-Paper-Datasets_ and_Benchmarks_Track.pdf
-
[76]
Show, attend and tell: Neural image caption generation with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research...
work page 2048
-
[77]
Latent drifting in diffusion models for counterfactual medical image synthesis
Yousef Yeganeh, Azade Farshad, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, and Ehsan Adeli. Latent drifting in diffusion models for counterfactual medical image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7685–7695, June 2025
work page 2025
-
[78]
On completeness-aware concept-based explanations in deep neural networks
Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Raviku- mar. On completeness-aware concept-based explanations in deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neu- ral Information Processing Systems, volume 33, pages 20554–20565. Curran Associates, Inc., 2020. U...
work page 2020
-
[79]
Ferret: Refer and ground anything anywhere at any granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2msbbX3ydD
work page 2024
-
[80]
Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319- 10590-1
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.