Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

Aidong Zhang; Guangzhi Xiong; Qiao Jin; Sanchit Sinha; Zhiyong Lu

arxiv: 2605.20158 · v1 · pith:F4BEOGTKnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL

Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

Guangzhi Xiong , Qiao Jin , Sanchit Sinha , Zhiyong Lu , Aidong Zhang This is my paper

Pith reviewed 2026-05-20 05:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords visual attributionlarge vision language modelschest X-raycausal evaluationcounterfactual editingoptimal transportmedical AIexplainable AI

0 comments

The pith

Existing visual attribution methods often fail to identify the actual evidence used by LVLMs for chest X-ray reasoning, while MedFocus succeeds by localizing anatomical concepts and measuring their causal effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether common visual attribution techniques truly reflect the visual evidence that large vision language models use when reasoning about chest X-rays. It builds a causal evaluation framework that keeps only those cases where expert-annotated regions can be shown, through counterfactual image edits, to causally drive the model's output. Testing eleven attribution methods on six open-source LVLMs reveals widespread failure to match the evidence the models actually rely on. To fix this, the authors introduce MedFocus, which identifies clinically meaningful anatomical regions with unbalanced optimal transport and then quantifies each region's causal influence via targeted interventions. The result is a method that supplies spatial, concept-level, and token-level attributions and outperforms prior approaches.

Core claim

The central claim is that standard visual attribution methods do not reliably recover the visual evidence underlying LVLM predictions on chest X-rays. A causal evaluation framework filters the dataset to samples where counterfactual editing of expert-annotated regions demonstrably alters model outputs, exposing that most existing methods misalign with these causal regions. MedFocus corrects the mismatch by first localizing anatomical concepts through unbalanced optimal transport and then assessing their causal impact on model outputs with targeted interventions, thereby generating multi-level attributions that are more faithful to the model's internal reasoning.

What carries the argument

MedFocus, which localizes clinically meaningful anatomical regions via unbalanced optimal transport and quantifies their causal effects on LVLM outputs through targeted interventions.

If this is right

Models that pass the causal filter can be paired with MedFocus to produce explanations that clinicians can verify against image content.
The same counterfactual framework can rank future attribution techniques by how well they recover regions that actually change predictions.
Multi-level outputs from MedFocus allow users to inspect attributions at the level of whole regions, specific concepts, or individual tokens.
Improved grounding reduces the risk that an LVLM bases a medical answer on irrelevant image areas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to other imaging modalities such as CT or MRI by retraining the anatomical concept localizer on new expert annotations.
Integrating the causal measurement step into model training might encourage LVLMs to rely more consistently on clinically relevant regions.
The framework offers a template for auditing explanation methods in any high-stakes domain where counterfactual edits are feasible.
Token-level attributions from MedFocus might help diagnose cases where the model attends to text prompts rather than image content.

Load-bearing premise

Counterfactual editing of the expert-annotated region cleanly isolates its causal contribution to the model's prediction without creating new artifacts or unintended side effects.

What would settle it

Apply any attribution method to a held-out CXR-VQA sample, then perform the same counterfactual edit on the region highlighted by that method instead of the expert region; if the change in model output is substantially smaller or absent, the attribution method is shown to be unfaithful.

Figures

Figures reproduced from arXiv: 2605.20158 by Aidong Zhang, Guangzhi Xiong, Qiao Jin, Sanchit Sinha, Zhiyong Lu.

**Figure 2.** Figure 2: Overview of the proposed MedFocus attribution pipeline. Words significantly affected by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning attribution evaluation on MedGround-Bench-Reason. Metrics are averaged [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on three MedGround-Bench-Direct examples. Ground-truth [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Token-level concept attribution for a MedGround-Bench-Reason example. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of MedFocus attributions across models and sample groups. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of MedFocus spatial attributions across Gemma3 and MedGemma [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Token-level concept attribution for a reasoning example about osteosynthesis material. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a causal filter for testing attribution methods on chest X-ray LVLMs and shows MedFocus beating baselines on the filtered set, but the edit-based verification step carries the main uncertainty.

read the letter

This paper's core move is to filter CXR-VQA examples so that only those where editing the expert-annotated region actually changes the LVLM output are kept for evaluation. On that subset they compare eleven attribution methods across six models and two output styles, then introduce MedFocus, which uses unbalanced optimal transport to localize clinically relevant concepts and reports better alignment at spatial, concept, and token levels.

Referee Report

2 major / 2 minor

Summary. The paper develops a causal evaluation framework for visual attribution in LVLMs on chest X-ray reasoning tasks. It filters CXR-VQA samples to retain only those where expert-annotated regions are verified as causally responsible for model predictions via counterfactual editing, evaluates 11 attribution methods across 6 open-source LVLMs in direct-answer and step-by-step modes, and introduces MedFocus, a concept-based method that localizes anatomical regions via unbalanced optimal transport and measures causal effects through targeted interventions. The central claim is that existing methods often fail to identify the evidence used by LVLMs while MedFocus substantially outperforms them, producing spatial, concept-level, and token-level attributions.

Significance. If the causal framework and comparisons hold, the work provides a more rigorous way to assess whether attribution methods reflect actual model reasoning in medical LVLMs and demonstrates a stronger alternative in MedFocus. This could improve trustworthiness of explanations in clinical applications by emphasizing causal verification over correlational attributions.

major comments (2)

[causal evaluation framework] Causal evaluation framework (abstract and methods description): The filtering of samples based on counterfactual editing assumes that region edits (e.g., masking or perturbation) cleanly isolate causal effects without altering global image statistics, introducing artifacts, or triggering unrelated behaviors in the LVLM's joint vision-language space. This premise is load-bearing for all downstream comparisons of the 11 methods and MedFocus, yet the manuscript provides no details on edit implementation, checks for unintended global changes, or sensitivity analyses across edit types.
[results] Results across 11 methods and 6 models: The claim of outperformance lacks reported statistical tests, effect sizes, or confidence intervals on the attribution accuracy metrics, making it difficult to assess whether MedFocus's gains are robust or could be explained by biases in the filtered dataset.

minor comments (2)

[abstract] The abstract and methods should clarify the exact counterfactual edit procedure (e.g., masking strategy, perturbation strength) and any controls for preserving non-target image properties.
[related work] Missing references to prior work on counterfactual interventions in vision-language models or limitations of optimal transport in medical imaging contexts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below with point-by-point responses. Revisions have been made to incorporate additional details and statistical analyses as suggested, strengthening the presentation of the causal framework and results.

read point-by-point responses

Referee: [causal evaluation framework] Causal evaluation framework (abstract and methods description): The filtering of samples based on counterfactual editing assumes that region edits (e.g., masking or perturbation) cleanly isolate causal effects without altering global image statistics, introducing artifacts, or triggering unrelated behaviors in the LVLM's joint vision-language space. This premise is load-bearing for all downstream comparisons of the 11 methods and MedFocus, yet the manuscript provides no details on edit implementation, checks for unintended global changes, or sensitivity analyses across edit types.

Authors: We appreciate the referee highlighting the importance of substantiating the assumptions underlying our counterfactual editing procedure. The original manuscript described the high-level approach but provided limited implementation specifics. In the revised version, we have added a dedicated subsection in the Methods that details the edit implementation, including the exact masking (region zeroing with boundary smoothing) and perturbation (Gaussian noise at controlled variance) techniques. We now include quantitative checks for unintended global changes by reporting pre- and post-edit differences in global statistics such as mean pixel intensity, standard deviation, and CLIP feature cosine similarity. Additionally, we present sensitivity analyses across edit types and strengths, demonstrating that the causal verification outcomes remain stable. These revisions directly address the load-bearing premise and enhance the framework's transparency. revision: yes
Referee: [results] Results across 11 methods and 6 models: The claim of outperformance lacks reported statistical tests, effect sizes, or confidence intervals on the attribution accuracy metrics, making it difficult to assess whether MedFocus's gains are robust or could be explained by biases in the filtered dataset.

Authors: We acknowledge that the original results would benefit from greater statistical rigor to support the outperformance claims. In the revised manuscript, we have added paired statistical tests (Wilcoxon signed-rank tests with Bonferroni correction) comparing MedFocus against the 11 baseline methods across all six LVLMs and both output modes. We report p-values, effect sizes (Cohen's d), and 95% confidence intervals obtained via bootstrapping for the attribution accuracy metrics. These analyses confirm that the observed gains are statistically significant and consistent, reducing the likelihood that they arise from biases in the filtered dataset, which is constructed uniformly for all methods. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation uses external expert annotations and interventions

full rationale

The paper's causal evaluation framework filters CXR-VQA samples based on expert-annotated regions verified via counterfactual editing to confirm causal responsibility for model predictions. This relies on independent external annotations and interventions rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations. MedFocus is introduced as a new concept-based method using unbalanced optimal transport for localization and targeted interventions for causal measurement, with performance compared across 11 methods, 6 LVLMs, and two output modes. No derivation step reduces by construction to the paper's own inputs or prior self-citations; the claims rest on external benchmarks and direct comparisons.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of counterfactual editing for causal verification and the reliability of expert-annotated regions as ground truth; no explicit free parameters or invented entities beyond the proposed method are detailed in the abstract.

axioms (1)

domain assumption Counterfactual editing on CXR images can isolate the causal impact of specific regions on LVLM predictions without introducing confounding changes
Invoked in the causal evaluation framework to verify expert-annotated regions (abstract).

invented entities (1)

MedFocus no independent evidence
purpose: Concept-based attribution method that localizes anatomical regions via unbalanced optimal transport and measures causal effects
New method introduced to address failures of prior attribution techniques

pith-pipeline@v0.9.0 · 5774 in / 1398 out tokens · 42325 ms · 2026-05-20T05:33:57.891020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 9 internal anchors

[1]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. U...

work page doi:10.18653/v1/2020.acl-main.385 2020
[2]

In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Suhyun Ahn, Wonjung Park, Jihoon Cho, and Jinah Park. V olumetric conditioning module to control pretrained diffusion models for 3d medical images. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 85–95, 2025. doi: 10.1109/ W ACV61041.2025.00019

work page arXiv 2025
[3]

Lang, Benedikt Wiestler, Julia A

Malek Ben Alaya, Daniel M. Lang, Benedikt Wiestler, Julia A. Schnabel, and Cosmin I. Bercea. Mededit: Counterfactual diffusion-based image editing on brain mri. In Virginia Fernandez, Jelmer M. Wolterink, David Wiesner, Samuel Remedios, Lianrui Zuo, and Adrià Casamitjana, editors,Simulation and Synthesis in Medical Imaging, pages 167–176, Cham, 2025. Spri...

work page 2025
[4]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018
[5]

AI Magazine , month = mar, pages =

Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, Mar. 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564

work page doi:10.1609/aimag.v36i1.2564 2015
[6]

PLOS ONE 13, e0203657

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone. 0130140. URLhttps://doi.org/10.1371/journal.pone.0130140

work page doi:10.1371/journal.pone 2015
[7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. URLhttps://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page
[9]

URLhttps://arxiv.org/abs/2502.13923. 10

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Network dissec- tion: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

work page 2017
[11]

SIAM Journal on Scientific Computing , author =

Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems.SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015. doi: 10.1137/141000439. URL https: //doi.org/10.1137/141000439

work page doi:10.1137/141000439 2015
[12]

Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay. Making the most of text semantics to improve biomedical vision–language processing. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Fa...

work page 2022
[13]

Borys, Y

Katarzyna Borys, Yasmin Alyssa Schmitt, Meike Nauta, Christin Seifert, Nicole Krämer, Christoph M. Friedrich, and Felix Nensa. Explainable ai in medical imaging: An overview for clinical practitioners – beyond saliency-based xai approaches.European Journal of Radiology, 162:110786, 2023. ISSN 0720-048X. doi: https://doi.org/10.1016/j.ejrad.2023.110786. UR...

work page doi:10.1016/j.ejrad.2023.110786 2023
[14]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847,

work page
[16]

doi: 10.1109/W ACV .2018.00097

work page doi:10.1109/w 2018
[17]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, June 2021

work page 2021
[18]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S Chaudhari, and Curtis Langlotz. Chexagent: Towards a foundation model for ches...

work page 2024
[19]

URLhttps://openreview.net/forum?id=P3LOmrZWGR

work page
[20]

Scaling algorithms for unbalanced optimal transport problems, 2018

Lenaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems, 2018. URL https://doi.org/10. 1090/mcom/3303

work page 2018
[21]

Unbalanced optimal transport: Dynamic and kantorovich formulations.Journal of Functional Analysis, 274 (11):3090–3123, 2018

Lénaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulations.Journal of Functional Analysis, 274 (11):3090–3123, 2018. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2018.03.008. URL https://www.sciencedirect.com/science/article/pii/S0022123618301058

work page doi:10.1016/j.jfa.2018.03.008 2018
[22]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Pape...

work page 2013
[23]

Human attention in visual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding, 163:90–100, 2017

Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding, 163:90–100, 2017. ISSN 1077-3142. doi: https://doi.org/ 10.1016/j.cviu.2017.10.001. URL https://www.sciencedirect.com/science/article/ pii/S107...

work page doi:10.1016/j.cviu.2017.10.001 2017
[24]

Daniel Coelho de Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas-Serrano, Javier Alvarez-Valle, Joaquín Galant-Herrero, and Antonio Pertusa. Padchest-gr: A bilingual chest x-ray dataset for groun...

work page doi:10.1056/aidbp2401120 2025
[25]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017
[26]

Towards automatic concept- based explanations

Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/...

work page 2019
[27]

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215. URL https://www.ahajournals.org/doi/abs/10.1161/ 01.CIR.101.23.e215

work page doi:10.1161/01.cir.101.23.e215 2000
[28]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering, 2020. URL https://arxiv.org/abs/ 2003.10286

work page internal anchor Pith review Pith/arXiv arXiv 2020
[29]

Juan Eugenio Iglesias and Mert R. Sabuncu. Multi-atlas segmentation of biomedical images: A survey.Medical Image Analysis, 24(1):205–219, 2015. ISSN 1361-8415. doi: https: //doi.org/10.1016/j.media.2015.06.012. URL https://www.sciencedirect.com/science/ article/pii/S1361841515000997

work page doi:10.1016/j.media.2015.06.012 2015
[30]

Comeau, Robert Leaman, Charalampos S

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, and Zhiyong Lu. Med-v1: Small language models for zero-shot and scalable biomedical evidence attribution, 2026. URL https://arxiv.org/abs/2603. 05308

work page 2026
[31]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

work page 2019
[32]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Alistair EW Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[33]

Do explanations explain? model knows best

Ashkan Khakzar, Pedram Khorsandi, Rozhin Nobahari, and Nassir Navab. Do explanations explain? model knows best. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10244–10253, June 2022

work page 2022
[34]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machin...

work page 2018
[35]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedin...

work page 2020
[36]

The disagreement problem in explainable machine learning: A practitioner’s perspective.Transactions on Machine Learning Research, 2024

Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, and Himabindu Lakkaraju. The disagreement problem in explainable machine learning: A practitioner’s perspective.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=jESY2WTZCe

work page 2024
[37]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, June 2024

work page 2024
[38]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

work page 2018
[39]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, ...

work page 2023
[40]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...

work page 2023
[41]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1587–1606, June 2025

work page 2025
[42]

arXiv preprint arXiv:2511.19046 (2025)

Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delving into segment anything with medical concepts, 2025. URL https://arxiv.org/abs/2511.19046

work page arXiv 2025
[43]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...

work page 2023
[44]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV...

work page 2024
[45]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 8a20a8621...

work page 2017
[46]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 417–435, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031- 72658-3

work page 2024
[47]

Segment anything in medical images.Nature communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature communications, 15(1):654, 2024

work page 2024
[48]

Medsam2: Segment anything in 3d medical images and videos,

Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, and Bo Wang. Medsam2: Segment anything in 3d medical images and videos, 2025. URLhttps://arxiv.org/abs/2504.03600

work page arXiv 2025
[49]

Yuille, and Kevin Murphy

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[50]

Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

work page 2023
[51]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

work page 2022
[52]

Capa- bilities of gpt-4 on medical challenge problems, 2023

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems, 2023. URL https://arxiv.org/abs/2303. 13375

work page 2023
[53]

Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi- task capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, and Kyoyun Choi. Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi- task capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=WQq5JPGQ0C

work page 2025
[54]

Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance. InMedical Imaging with Deep Learning, 2025. URL https://openreview.net/ forum?id=trUvr1gSNI

work page 2025
[55]

Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=lLmqxkfSIw

work page 2024
[56]

Sanchez, Boris van Breugel, Daniel C

Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, and Maximilian Ilse. Radedit: Stress-testing biomedical vision models via diffusion image editing. In Aleš Leonar...

work page 2024
[57]

RISE: randomized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. InBritish Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 151. BMV A Press, 2018. URL http://bmvc2018.org/ contents/papers/1064.pdf

work page 2018
[58]

Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

work page 2019
[59]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 14

work page 2015
[60]

How explainable artificial intelligence can increase or decrease clinicians’ trust in ai applications in health care: Systematic review.JMIR AI, 3:e53207, Oct 2024

Rikard Rosenbacke, Åsa Melhus, Martin McKee, and David Stuckler. How explainable artificial intelligence can increase or decrease clinicians’ trust in ai applications in health care: Systematic review.JMIR AI, 3:e53207, Oct 2024. ISSN 2817-1705. doi: 10.2196/53207. URL https://ai.jmir.org/2024/1/e53207

work page doi:10.2196/53207 2024
[61]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017
[63]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017. URLhttps://proceedings. ...

work page 2017
[64]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. URL https://arxiv.org/ abs/1312.6034

work page internal anchor Pith review Pith/arXiv arXiv 2014
[65]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

work page 2023
[66]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3319–3328. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ sundararajan17a.html

work page 2017
[67]

Interactive and explainable region-guided radiology report generation

Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7433–7442, June 2023

work page 2023
[68]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Sara Mahdavi, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Joelle Barral, Dale Webster, Greg S

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...

work page doi:10.1056/aioa2300138 2024
[70]

Ozolek, Dejan Slepˇcev, Ann B

Wei Wang, John A. Ozolek, Dejan Slepˇcev, Ann B. Lee, Cheng Chen, and Gustavo K. Rohde. An optimal transportation approach for nuclear structure-based pathology.IEEE Transactions on Medical Imaging, 30(3):621–631, 2011. doi: 10.1109/TMI.2010.2089693

work page doi:10.1109/tmi.2010.2089693 2011
[71]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation, 2025. URL https://arxiv.org/abs/2506.18871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Chest imagenome dataset for clinical reasoning

Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, and Mehdi Moradi. Chest imagenome dataset for clinical reasoning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

work page 2021
[73]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models, 2025. URL https://arxiv.org/abs/2503.12799

work page arXiv 2025
[74]

Cares: A comprehensive 16 benchmark of trustworthiness in medical vision language models

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, and Huaxiu Yao. Cares: A comprehensive 16 benchmark of trustworthiness in m...

work page
[75]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/fde7f40f8ced5735006810534dc66b33-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-4455. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/fde7f40f8ced5735006810534dc66b33-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-4455 2024
[76]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research...

work page 2048
[77]

Latent drifting in diffusion models for counterfactual medical image synthesis

Yousef Yeganeh, Azade Farshad, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, and Ehsan Adeli. Latent drifting in diffusion models for counterfactual medical image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7685–7695, June 2025

work page 2025
[78]

On completeness-aware concept-based explanations in deep neural networks

Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Raviku- mar. On completeness-aware concept-based explanations in deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neu- ral Information Processing Systems, volume 33, pages 20554–20565. Curran Associates, Inc., 2020. U...

work page 2020
[79]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2msbbX3ydD

work page 2024
[80]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319- 10590-1

work page 2014

Showing first 80 references.

[1] [1]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. U...

work page doi:10.18653/v1/2020.acl-main.385 2020

[2] [2]

In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

Suhyun Ahn, Wonjung Park, Jihoon Cho, and Jinah Park. V olumetric conditioning module to control pretrained diffusion models for 3d medical images. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 85–95, 2025. doi: 10.1109/ W ACV61041.2025.00019

work page arXiv 2025

[3] [3]

Lang, Benedikt Wiestler, Julia A

Malek Ben Alaya, Daniel M. Lang, Benedikt Wiestler, Julia A. Schnabel, and Cosmin I. Bercea. Mededit: Counterfactual diffusion-based image editing on brain mri. In Virginia Fernandez, Jelmer M. Wolterink, David Wiesner, Samuel Remedios, Lianrui Zuo, and Adrià Casamitjana, editors,Simulation and Synthesis in Medical Imaging, pages 167–176, Cham, 2025. Spri...

work page 2025

[4] [4]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

work page 2018

[5] [5]

AI Magazine , month = mar, pages =

Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation.AI Magazine, 36(1):15–24, Mar. 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564

work page doi:10.1609/aimag.v36i1.2564 2015

[6] [6]

PLOS ONE 13, e0203657

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone. 0130140. URLhttps://doi.org/10.1371/journal.pone.0130140

work page doi:10.1371/journal.pone 2015

[7] [7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. URLhttps://arxiv.org/abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page

[9] [9]

URLhttps://arxiv.org/abs/2502.13923. 10

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Network dissec- tion: Quantifying interpretability of deep visual representations

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissec- tion: Quantifying interpretability of deep visual representations. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

work page 2017

[11] [11]

SIAM Journal on Scientific Computing , author =

Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems.SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015. doi: 10.1137/141000439. URL https: //doi.org/10.1137/141000439

work page doi:10.1137/141000439 2015

[12] [12]

Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C. Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, Hoifung Poon, and Ozan Oktay. Making the most of text semantics to improve biomedical vision–language processing. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Fa...

work page 2022

[13] [13]

Borys, Y

Katarzyna Borys, Yasmin Alyssa Schmitt, Meike Nauta, Christin Seifert, Nicole Krämer, Christoph M. Friedrich, and Felix Nensa. Explainable ai in medical imaging: An overview for clinical practitioners – beyond saliency-based xai approaches.European Journal of Radiology, 162:110786, 2023. ISSN 0720-048X. doi: https://doi.org/10.1016/j.ejrad.2023.110786. UR...

work page doi:10.1016/j.ejrad.2023.110786 2023

[14] [14]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847,

work page

[16] [16]

doi: 10.1109/W ACV .2018.00097

work page doi:10.1109/w 2018

[17] [17]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, June 2021

work page 2021

[18] [18]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S Chaudhari, and Curtis Langlotz. Chexagent: Towards a foundation model for ches...

work page 2024

[19] [19]

URLhttps://openreview.net/forum?id=P3LOmrZWGR

work page

[20] [20]

Scaling algorithms for unbalanced optimal transport problems, 2018

Lenaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Scaling algorithms for unbalanced optimal transport problems, 2018. URL https://doi.org/10. 1090/mcom/3303

work page 2018

[21] [21]

Unbalanced optimal transport: Dynamic and kantorovich formulations.Journal of Functional Analysis, 274 (11):3090–3123, 2018

Lénaïc Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. Unbalanced optimal transport: Dynamic and kantorovich formulations.Journal of Functional Analysis, 274 (11):3090–3123, 2018. ISSN 0022-1236. doi: https://doi.org/10.1016/j.jfa.2018.03.008. URL https://www.sciencedirect.com/science/article/pii/S0022123618301058

work page doi:10.1016/j.jfa.2018.03.008 2018

[22] [22]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/file/ af21d0c97db2e27e13572cbf59eb343d-Pape...

work page 2013

[23] [23]

Human attention in visual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding, 163:90–100, 2017

Abhishek Das, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions?Computer Vision and Image Understanding, 163:90–100, 2017. ISSN 1077-3142. doi: https://doi.org/ 10.1016/j.cviu.2017.10.001. URL https://www.sciencedirect.com/science/article/ pii/S107...

work page doi:10.1016/j.cviu.2017.10.001 2017

[24] [24]

Daniel Coelho de Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas-Serrano, Javier Alvarez-Valle, Joaquín Galant-Herrero, and Antonio Pertusa. Padchest-gr: A bilingual chest x-ray dataset for groun...

work page doi:10.1056/aidbp2401120 2025

[25] [25]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017

[26] [26]

Towards automatic concept- based explanations

Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/ 2019/file/...

work page 2019

[27] [27]

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215. URL https://www.ahajournals.org/doi/abs/10.1161/ 01.CIR.101.23.e215

work page doi:10.1161/01.cir.101.23.e215 2000

[28] [28]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering, 2020. URL https://arxiv.org/abs/ 2003.10286

work page internal anchor Pith review Pith/arXiv arXiv 2020

[29] [29]

Juan Eugenio Iglesias and Mert R. Sabuncu. Multi-atlas segmentation of biomedical images: A survey.Medical Image Analysis, 24(1):205–219, 2015. ISSN 1361-8415. doi: https: //doi.org/10.1016/j.media.2015.06.012. URL https://www.sciencedirect.com/science/ article/pii/S1361841515000997

work page doi:10.1016/j.media.2015.06.012 2015

[30] [30]

Comeau, Robert Leaman, Charalampos S

Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, and Zhiyong Lu. Med-v1: Small language models for zero-shot and scalable biomedical evidence attribution, 2026. URL https://arxiv.org/abs/2603. 05308

work page 2026

[31] [31]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

work page 2019

[32] [32]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Alistair EW Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[33] [33]

Do explanations explain? model knows best

Ashkan Khakzar, Pedram Khorsandi, Rozhin Nobahari, and Nassir Navab. Do explanations explain? model knows best. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10244–10253, June 2022

work page 2022

[34] [34]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCA V). In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machin...

work page 2018

[35] [35]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5338–5348. PMLR, 13–18 Jul 2020. URL https://proceedin...

work page 2020

[36] [36]

The disagreement problem in explainable machine learning: A practitioner’s perspective.Transactions on Machine Learning Research, 2024

Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, and Himabindu Lakkaraju. The disagreement problem in explainable machine learning: A practitioner’s perspective.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=jESY2WTZCe

work page 2024

[37] [37]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589, June 2024

work page 2024

[38] [38]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

work page 2018

[39] [39]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, ...

work page 2023

[40] [40]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 o...

work page 2023

[41] [41]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 1587–1606, June 2025

work page 2025

[42] [42]

arXiv preprint arXiv:2511.19046 (2025)

Anglin Liu, Rundong Xue, Xu R. Cao, Yifan Shen, Yi Lu, Xiang Li, Qianqian Chen, and Jintai Chen. Medsam3: Delving into segment anything with medical concepts, 2025. URL https://arxiv.org/abs/2511.19046

work page arXiv 2025

[43] [43]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 6dcf277ea32ce3288...

work page 2023

[44] [44]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV...

work page 2024

[45] [45]

A unified approach to interpreting model predictions

Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 8a20a8621...

work page 2017

[46] [46]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 417–435, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031- 72658-3

work page 2024

[47] [47]

Segment anything in medical images.Nature communications, 15(1):654, 2024

Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature communications, 15(1):654, 2024

work page 2024

[48] [48]

Medsam2: Segment anything in 3d medical images and videos,

Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mohammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, and Bo Wang. Medsam2: Segment anything in 3d medical images and videos, 2025. URLhttps://arxiv.org/abs/2504.03600

work page arXiv 2025

[49] [49]

Yuille, and Kevin Murphy

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016

[50] [50]

Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

work page 2023

[51] [51]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

work page 2022

[52] [52]

Capa- bilities of gpt-4 on medical challenge problems, 2023

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems, 2023. URL https://arxiv.org/abs/2303. 13375

work page 2023

[53] [53]

Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi- task capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, and Kyoyun Choi. Radzero: Similarity-based cross-attention for explainable vision-language alignment in chest x-ray with zero-shot multi- task capability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=WQq5JPGQ0C

work page 2025

[54] [54]

Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Benedikt Wiestler, Nassir Navab, and Matthias Keicher. Radialog: Large vision-language models for x-ray reporting and dialog-driven as- sistance. InMedical Imaging with Deep Learning, 2025. URL https://openreview.net/ forum?id=trUvr1gSNI

work page 2025

[55] [55]

Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=lLmqxkfSIw

work page 2024

[56] [56]

Sanchez, Boris van Breugel, Daniel C

Fernando Pérez-García, Sam Bond-Taylor, Pedro P. Sanchez, Boris van Breugel, Daniel C. Castro, Harshita Sharma, Valentina Salvatelli, Maria T. A. Wetscherek, Hannah Richardson, Matthew P. Lungren, Aditya Nori, Javier Alvarez-Valle, Ozan Oktay, and Maximilian Ilse. Radedit: Stress-testing biomedical vision models via diffusion image editing. In Aleš Leonar...

work page 2024

[57] [57]

RISE: randomized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. InBritish Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, page 151. BMV A Press, 2018. URL http://bmvc2018.org/ contents/papers/1064.pdf

work page 2018

[58] [58]

Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport.Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019

work page 2019

[59] [59]

Plummer, Liwei Wang, Chris M

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 14

work page 2015

[60] [60]

How explainable artificial intelligence can increase or decrease clinicians’ trust in ai applications in health care: Systematic review.JMIR AI, 3:e53207, Oct 2024

Rikard Rosenbacke, Åsa Melhus, Martin McKee, and David Stuckler. How explainable artificial intelligence can increase or decrease clinicians’ trust in ai applications in health care: Systematic review.JMIR AI, 3:e53207, Oct 2024. ISSN 2817-1705. doi: 10.2196/53207. URL https://ai.jmir.org/2024/1/e53207

work page doi:10.2196/53207 2024

[61] [61]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017

work page 2017

[63] [63]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017. URLhttps://proceedings. ...

work page 2017

[64] [64]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014. URL https://arxiv.org/ abs/1312.6034

work page internal anchor Pith review Pith/arXiv arXiv 2014

[65] [65]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

work page 2023

[66] [66]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Con- ference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 3319–3328. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/ sundararajan17a.html

work page 2017

[67] [67]

Interactive and explainable region-guided radiology report generation

Tim Tanida, Philip Müller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7433–7442, June 2023

work page 2023

[68] [68]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Sara Mahdavi, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Joelle Barral, Dale Webster, Greg S

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...

work page doi:10.1056/aioa2300138 2024

[70] [70]

Ozolek, Dejan Slepˇcev, Ann B

Wei Wang, John A. Ozolek, Dejan Slepˇcev, Ann B. Lee, Cheng Chen, and Gustavo K. Rohde. An optimal transportation approach for nuclear structure-based pathology.IEEE Transactions on Medical Imaging, 30(3):621–631, 2011. doi: 10.1109/TMI.2010.2089693

work page doi:10.1109/tmi.2010.2089693 2011

[71] [71]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation, 2025. URL https://arxiv.org/abs/2506.18871

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Chest imagenome dataset for clinical reasoning

Joy T Wu, Nkechinyere Nneka Agu, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward Christopher Dee, William G Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, and Mehdi Moradi. Chest imagenome dataset for clinical reasoning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Be...

work page 2021

[73] [73]

Grounded chain-of-thought for multimodal large language models.arXiv preprint arXiv:2503.12799, 2025

Qiong Wu, Xiangcong Yang, Yiyi Zhou, Chenxin Fang, Baiyang Song, Xiaoshuai Sun, and Rongrong Ji. Grounded chain-of-thought for multimodal large language models, 2025. URL https://arxiv.org/abs/2503.12799

work page arXiv 2025

[74] [74]

Cares: A comprehensive 16 benchmark of trustworthiness in medical vision language models

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, and Huaxiu Yao. Cares: A comprehensive 16 benchmark of trustworthiness in m...

work page

[75] [75]

URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/fde7f40f8ced5735006810534dc66b33-Paper-Datasets_ and_Benchmarks_Track.pdf

doi: 10.52202/079017-4455. URL https://proceedings.neurips.cc/paper_ files/paper/2024/file/fde7f40f8ced5735006810534dc66b33-Paper-Datasets_ and_Benchmarks_Track.pdf

work page doi:10.52202/079017-4455 2024

[76] [76]

Show, attend and tell: Neural image caption generation with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research...

work page 2048

[77] [77]

Latent drifting in diffusion models for counterfactual medical image synthesis

Yousef Yeganeh, Azade Farshad, Ioannis Charisiadis, Marta Hasny, Martin Hartenberger, Björn Ommer, Nassir Navab, and Ehsan Adeli. Latent drifting in diffusion models for counterfactual medical image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7685–7695, June 2025

work page 2025

[78] [78]

On completeness-aware concept-based explanations in deep neural networks

Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Raviku- mar. On completeness-aware concept-based explanations in deep neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neu- ral Information Processing Systems, volume 33, pages 20554–20565. Curran Associates, Inc., 2020. U...

work page 2020

[79] [79]

Ferret: Refer and ground anything anywhere at any granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2msbbX3ydD

work page 2024

[80] [80]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319- 10590-1

work page 2014