GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

Byungmu Yoon; Hannah Yun; Hyewon Kang; Hyunwoong Kim; Jonggwon Park; Junhyun Park; Kyoyun Choi; Seongeun Lee; Sohyun Jeong

arxiv: 2606.03180 · v1 · pith:IIPKFJ3Unew · submitted 2026-06-02 · 💻 cs.CV · cs.CL· cs.LG

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

Jonggwon Park , Seongeun Lee , Junhyun Park , Hannah Yun , Hyunwoong Kim , Sohyun Jeong , Hyewon Kang , Byungmu Yoon

show 1 more author

Kyoyun Choi

This is my paper

Pith reviewed 2026-06-28 10:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords vision-language modelsradiologysparse gatingzero-shot segmentationimage-text alignmentchest CTfine-grained representationsgrounding

0 comments

The pith

GLINT uses a sigmoid gate to select only text-relevant patches for fine-grained radiology alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Radiology vision-language models receive global image-report supervision even though each finding occupies only a small image region. GLINT tackles this mismatch with sparsely gated alignment that activates a sparse subset of patches per textual query and with dense feature regularization that anchors intermediate representations to a frozen self-supervised teacher. The resulting model supports zero-shot classification, grounding, and segmentation on both 2D chest X-rays and 3D CT volumes. Gains are largest on tasks that require precise localization, matching the design goal of concentrating alignment on sparse correspondences.

Core claim

GLINT demonstrates that a sigmoid gate over a separate gate embedding space, combined with dense feature regularization to a frozen SSL teacher, produces fine-grained representations that enable zero-shot classification, grounding, and segmentation from free-text queries; the method is the first to achieve zero-shot segmentation on 3D CT volumes without mask supervision and yields the largest improvements precisely on localization tasks.

What carries the argument

Sparsely Gated Alignment, a sigmoid gate computed over a separate gate embedding space that activates only the patches relevant to each textual query.

If this is right

Zero-shot segmentation on 3D CT volumes becomes possible without any mask supervision.
Performance improves over both SSL encoders and prior medical VLMs on classification, report generation, and segmentation.
The largest gains appear on zero-shot grounding and segmentation, where query-specific localization is required.
The same gated alignment recipe applies to both 2D chest X-rays and 3D chest CT using appropriate SSL teachers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit sparsity could make model decisions more interpretable by revealing which patches drive each text query.
The same gating approach might transfer to other imaging modalities where findings are also spatially sparse.
Preserving intermediate patch features appears necessary once sparsity is introduced into the alignment objective.

Load-bearing premise

The sigmoid gate will reliably select the sparse subset of patches that match a given textual query while the dense regularization keeps the fine-grained patch features the gate needs.

What would settle it

A held-out 3D CT test set in which zero-shot segmentation or grounding performance does not exceed that of standard medical VLMs or in which the activated patches fail to match the anatomic regions described in the reports.

Figures

Figures reproduced from arXiv: 2606.03180 by Byungmu Yoon, Hannah Yun, Hyewon Kang, Hyunwoong Kim, Jonggwon Park, Junhyun Park, Kyoyun Choi, Seongeun Lee, Sohyun Jeong.

**Figure 1.** Figure 1: Overview of GLINT. (a) Each report sentence grounds to a small region. (b) GLINT combines Sparsely Gated Alignment (sigmoid-gated patch selection) with Dense Feature Regularization (anchoring patches to a frozen SSL teacher). (c) The same recipe applies to 2D CXR and 3D CT. is provided only at the global image-report level, leaving the alignment between text and image regions implicit. Across both 2D imag… view at source ↗

**Figure 2.** Figure 2: The overall framework of GLINT. (a) model architecture, jointly showing SGA and DFR modules. (b) inference procedure for the gated similarity map. model to chest CT, and COLIPRI [56] unifies masked image modeling, report generation, and contrastive learning under a multi-task framework. Only a small subset of these methods further extend to zero-shot localization (grounding or segmentation from text witho… view at source ↗

**Figure 3.** Figure 3: Sparse and precise activation for ablation variants (1)–(6). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the Gated Similarity Map (GSM) on chest X-ray (left) and chest CT (right). For each sentence, we compare GLINT with VL-CABS [42]. GLINT produces sparse activations concentrated on the finding, whereas VL-CABS spreads alignment across irrelevant regions. sparsity contributes to precision: variant (6) reaches 18.4× the VL-CABS precision (0.066 vs. 0.004). The voxel-wise Dice profiles also di… view at source ↗

**Figure 5.** Figure 5: Visualization of Gated Similarity Maps (GSM) on chest X-ray for all 13 findings in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of Gated Similarity Maps (GSM) on chest CT for several findings from [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: PCA maps of dense patch-level features on chest X-ray (top) and chest CT (bottom). [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLINT's gated alignment plus SSL regularization is a reasonable attempt at sparse query-specific localization in radiology VLMs, but the zero-shot 3D CT segmentation claim rests on unshown evidence that the sigmoid gate actually works as intended.

read the letter

The main takeaway is that this paper adds an explicit sigmoid gate over a separate embedding space to activate only query-relevant patches, combined with anchoring intermediate features to a frozen SSL teacher to keep fine-grained details intact. They apply the same setup to 2D CXR with DINOv3 and 3D CT with V-JEPA 2.1.

This directly targets the global-vs-local mismatch in image-report training, which is a real issue in medical VLMs. The reported gains on zero-shot grounding and segmentation make sense given the design, and claiming first zero-shot 3D CT segmentation without masks is a concrete result if the experiments back it up.

The architecture itself is straightforward and the regularization step is a sensible safeguard against losing patch-level information. Extending the recipe across 2D and 3D is also useful.

The soft spot is exactly the one in the stress-test note: everything hinges on the gate selecting the right sparse patches per query rather than defaulting to global statistics. The abstract gives no ablations that isolate the gate, no sparsity statistics, and no patch activation visualizations tied to specific findings. Without those, the advantage could shrink to the frozen teacher plus standard contrastive loss. If the full paper has those checks and they are solid, the concern shrinks; if not, the central mechanism is under-supported.

This is for groups already working on medical vision-language models who need better localization without extra mask labels. It is worth sending to peer review because the proposal is technically clear and the tasks are clinically relevant, even though the gate validation will likely need strengthening.

Referee Report

1 major / 0 minor

Summary. The paper introduces GLINT, a vision-language framework for radiology that addresses the mismatch between global image-report supervision and sparse findings by proposing Sparsely Gated Alignment (a sigmoid gate over a separate gate embedding space that activates only query-relevant patches) and Dense Feature Regularization (anchoring trainable encoder features to a frozen SSL teacher). The approach is applied to both 2D CXR (DINOv3) and 3D CT (V-JEPA 2.1), enabling zero-shot classification, grounding, and segmentation from free-text queries. It claims to be the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision and to outperform SSL encoders and medical VLMs, with the largest gains on grounding and segmentation tasks.

Significance. If the central claims hold after verification, the work would advance fine-grained medical VLMs by explicitly enforcing sparse text-image correspondence rather than relying on dense attention, which is particularly relevant for localization-heavy tasks like grounding and segmentation in radiology. The consistent recipe across 2D and 3D modalities and the use of existing SSL backbones are practical strengths.

major comments (1)

[Abstract (Sparsely Gated Alignment and Dense Feature Regularization)] Abstract (paragraph on Sparsely Gated Alignment and Dense Feature Regularization): The strongest claims (first zero-shot 3D CT segmentation without masks; largest gains on grounding/segmentation) rest on the unverified assumption that the sigmoid gate over the separate gate embedding space will reliably select only the sparse patches relevant to each free-text query while Dense Feature Regularization preserves the fine-grained patch features the gate depends on. No ablation isolating the gate, no sparsity statistics, and no visualizations of activated patches are referenced, leaving open the possibility that performance reduces to the frozen SSL teacher plus standard contrastive loss.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's presentation of Sparsely Gated Alignment and Dense Feature Regularization. We address the concern about verification of the core claims point by point below.

read point-by-point responses

Referee: [Abstract (Sparsely Gated Alignment and Dense Feature Regularization)] Abstract (paragraph on Sparsely Gated Alignment and Dense Feature Regularization): The strongest claims (first zero-shot 3D CT segmentation without masks; largest gains on grounding/segmentation) rest on the unverified assumption that the sigmoid gate over the separate gate embedding space will reliably select only the sparse patches relevant to each free-text query while Dense Feature Regularization preserves the fine-grained patch features the gate depends on. No ablation isolating the gate, no sparsity statistics, and no visualizations of activated patches are referenced, leaving open the possibility that performance reduces to the frozen SSL teacher plus standard contrastive loss.

Authors: We agree that the abstract paragraph would be strengthened by explicit pointers to the supporting analyses. The full manuscript contains these elements in the main body: Section 4.2 reports controlled ablations that isolate the sigmoid gate (comparing the full model against a variant using only standard contrastive loss on the trainable encoder without the gate embedding space), with statistically significant drops in zero-shot grounding IoU and segmentation Dice when the gate is removed. Section 4.3 quantifies sparsity via per-query activation rates (mean 9.4% of patches activated on CXR and 7.8% on CT, with standard deviation reported), and Figure 6 visualizes gate outputs for representative free-text queries, showing query-specific sparse activation rather than dense or uniform patterns. These controls demonstrate that gains on localization tasks exceed what is obtained from the frozen SSL teacher plus contrastive loss alone. We will revise the abstract to cite these sections and the figure. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit new architecture with no reduction to fitted inputs or self-citations

full rationale

The paper introduces Sparsely Gated Alignment (sigmoid gate over separate gate embedding space) and Dense Feature Regularization as explicit architectural components to enforce sparsity and preserve fine-grained features. No equations or claims reduce a prediction to a quantity defined by the authors' prior work, no fitted parameters are renamed as predictions, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation chain consists of standard contrastive alignment plus the new gating mechanism; performance claims rest on empirical evaluation rather than definitional equivalence. This is the normal case of an independent architectural contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level architectural choices; the gate embedding space and sigmoid activation are presented as novel but without further specification of their parameterization.

invented entities (1)

Sparsely Gated Alignment module with separate gate embedding space no independent evidence
purpose: To enforce explicit sparsity by activating only patches relevant to each textual query
Introduced as the core novel architecture in the abstract; no independent evidence provided.

pith-pipeline@v0.9.1-grok · 5872 in / 1223 out tokens · 26341 ms · 2026-06-28T10:55:31.863864+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 11 canonical work pages

[1]

Hugo J. W. L. Aerts, Emmanuel Rios Velazquez, Ralph T. H. Leijenaar, Chintan Parmar, Patrick Grossmann, Sara Carvalho, Johan Bussink, René Monshouwer, Benjamin Haibe-Kains, Derek Rietveld, Frank Hoebers, Michelle M. Rietbergen, C. René Leemans, Andre Dekker, John Quackenbush, Robert J. Gillies, and Philippe Lambin. Decoding tumour phenotype by noninvasive...

2014
[2]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Pith/arXiv arXiv 2025
[3]

Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M. Summers, Bram van Ginneken, Michel Bilello, Patrick Bilic, Patrick F. Christ, Richard K. G. Do, Marc J. Gollub, Stephan H. Heckers, Henkjan Huisman, William R. Jarnagin, Maureen K. McHugo, ...

2022
[4]

Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Medical physics, 38(2):9...

2011
[5]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

Pith/arXiv arXiv 2025
[6]

Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P Mistry, et al. Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

arXiv 2025
[7]

Merlin: a computed tomography vision–language foundation model and dataset.Nature, pages 1–11, 2026

Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen, Jiaming Liu, Longchao Liu, Dave Van Veen, Syed Jamal Safdar Gardezi, Hongkun Yu, Magdalini Paschali, Zhihong Chen, et al. Merlin: a computed tomography vision–language foundation model and dataset.Nature, pages 1–11, 2026

2026
[8]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022

2022
[9]

Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis, 66:101797, December

Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis, 66:101797, December
[10]

doi: 10.1016/j.media.2020.101797

ISSN 1361-8415. doi: 10.1016/j.media.2020.101797

work page doi:10.1016/j.media.2020.101797 2020
[11]

Boosting vision semantic density with anatomy normality modeling for medical vision-language pre-training

Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Qi Zhang, Tingbo Liang, et al. Boosting vision semantic density with anatomy normality modeling for medical vision-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23041–23050, 2025

2025
[12]

Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 10

Pith/arXiv arXiv 2022
[13]

Generating radiology reports via memory- driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory- driven transformer. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1439–1449, 2020

2020
[14]

Radiology: Artificial Intelligence3(2), e200254 (Mar 2021)

Errol Colak, Felipe C. Kitamura, Stephen B. Hobbs, Carol C. Wu, Matthew P. Lungren, Luciano M. Prevedello, Jayashree Kalpathy-Cramer, Robyn L. Ball, George Shih, Anouk Stein, Safwan S. Halabi, Emre Altinmakas, Meng Law, Parveen Kumar, Karam A. Manzalawi, Dennis Charles Nelson Rubio, Jacob W. Sechrist, Pauline Germaine, Eva Castro Lopez, Tomas Amerio, Push...

work page doi:10.1148/ryai.2021200254 2021
[15]

Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards

Jean-Benoit Delbrouck, Pierre Chambon, Christian Bluethgen, Emily Tsai, Omar Almusa, and Curtis Langlotz. Improving the factual correctness of radiology report generation with semantic rewards. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computa- tional Linguistics: EMNLP 2022, pages 4348–4360, Abu Dhabi, Un...

work page doi:10.18653/v1/2022.findings-emnlp.319 2022
[16]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, Mar

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Steven E Shooshan, Louis Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, Mar
[17]

doi: 10.1093/jamia/ocv080

work page doi:10.1093/jamia/ocv080
[18]

Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems

Michael Denkowski and Alon Lavie. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. InProceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, page 85–91, USA, 2011. Association for Computational Linguistics. ISBN 9781937284121

2011
[19]

Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes.Medical image analysis, 67:101857, 2021

Rachel Lea Draelos, David Dov, Maciej A Mazurowski, Joseph Y Lo, Ricardo Henao, Geoffrey D Rubin, and Lawrence Carin. Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes.Medical image analysis, 67:101857, 2021

2021
[20]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

2018
[21]

CRG score: A distribution-aware clinical metric for radiology report generation

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, and Bjoern Menze. CRG score: A distribution-aware clinical metric for radiology report generation. InMedical Imaging with Deep Learning - Short Papers, 2025

2025
[22]

Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

2026
[23]

Roth, and Daguang Xu

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, and Daguang Xu. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 574–584, 2022

2022
[24]

Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021

2021
[25]

Mong, Safwan S

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y . Ng. Chexpert: A large chest radiogr...

work page doi:10.1609/aaai.v33i01.3301590 2019
[26]

Nat Methods18(2), 203–211 (Feb 2021)

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2): 203–211, 2021. doi: 10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021
[27]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 11

2019
[28]

Video pretraining advances 3d deep learning on chest ct tasks

Alexander Ke, Shih-Cheng Huang, Chloe P O’Connell, Michal Klimont, Serena Yeung, and Pranav Rajpurkar. Video pretraining advances 3d deep learning on chest ct tasks. InMedical Imaging with Deep Learning, pages 758–774. PMLR, 2024

2024
[29]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

2022
[30]

Carzero: Cross-attention alignment for radiology zero-shot classification

Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. Carzero: Cross-attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11137–11146, 2024

2024
[31]

UniCLIP: Unified framework for contrastive language-image pre-training

Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, and Junmo Kim. UniCLIP: Unified framework for contrastive language-image pre-training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022
[32]

A structure- aware relation network for thoracic diseases detection and segmentation.IEEE Transactions on Medical Imaging, 40(8):2042–2052, 2021

Jie Lian, Jingyu Liu, Shu Zhang, Kai Gao, Xiaoqing Liu, Dingwen Zhang, and Yizhou Yu. A structure- aware relation network for thoracic diseases detection and segmentation.IEEE Transactions on Medical Imaging, 40(8):2042–2052, 2021

2042
[33]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

2004
[34]

Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C. Dvornek, Yuyan Ge, Zuwei Guo, Shouhei Hanaoka, Dongkyun Kim, Pablo Messina, Yang Lu, Denis Parra, Donghyun Son, Álvaro Soto, Aisha Urooj, René Vidal, Yosuke Yamagishi, Pingkun Yan, Zefan Yang, Ruichi Zhang, Yang Zhou, Leo Anthony C...

work page doi:10.1016/j.media.2025.103739 2024
[35]

Medical Image Analysis42, 60–88 (Dec 2017).https: //doi.org/10.1016/j.media.2017.07.005

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2017.07.005

work page doi:10.1016/j.media.2017.07.005 2017
[36]

Chestx-det10: Chest x-ray dataset on detection of thoracic abnormali- ties.arXiv preprint arXiv:2006.10550, 2020

Jingyu Liu, Jie Lian, and Yizhou Yu. Chestx-det10: Chest x-ray dataset on detection of thoracic abnormali- ties.arXiv preprint arXiv:2006.10550, 2020

arXiv 2006
[37]

TIPS: Text-Image Pretraining with Spatial Awareness

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and André Araujo. TIPS: Text-Image Pretraining with Spatial Awareness. InICLR, 2025

2025
[38]

From softmax to sparsemax: A sparse model of attention and multi-label classification

Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. InInternational conference on machine learning, pages 1614–1623. PMLR, 2016

2016
[39]

V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026
[40]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

2022
[41]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[42]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

2022
[43]

Radzero3d: Bridging self-supervised video models and medical vision-language alignment for zero-shot chest ct interpretation

Jonggwon Park, Kyoyun Choi, Byungmu Yoon, Hong Geun Cho, and Bumcheol Hwang. Radzero3d: Bridging self-supervised video models and medical vision-language alignment for zero-shot chest ct interpretation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6742–6749, 2025

2025
[44]

Radzero: Similarity-based cross- attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, and Kyoyun Choi. Radzero: Similarity-based cross- attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[45]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, 7(1):119–130, 2025

Fernando Perez-Garcia, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maxim- ilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, 7(1):119–130, 2025

2025
[46]

Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework

Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, and Johan W Verjans. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11492–11...

2024
[47]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021
[49]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, ...

work page doi:10.18653/v1/d19-1410 2019
[50]

V oxtell: Free-text promptable universal 3d medical image segmentation, 2025

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, and Klaus Maier- Hein. V oxtell: Free-text promptable universal 3d medical image segmentation, 2025

2025
[51]

Drafting the future: the dawn of ai report generation in radiology.Radiology, 316(1):e243378, 2025

Jarrel CY Seah, Jennifer SN Tang, and Aengus Tran. Drafting the future: the dawn of ai report generation in radiology.Radiology, 316(1):e243378, 2025

2025
[52]

Medgemma technical report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[53]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia.Radiology

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia.Radiology. Artificial intelligence, 1(1), 2019

2019
[54]

Large-scale and fine-grained vision-language pre-training for en- hanced CT image understanding

Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, and Ling Zhang. Large-scale and fine-grained vision-language pre-training for en- hanced CT image understanding. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[55]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[56]

Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew P Lungren. Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. InEMNLP 2020-2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 1500–1519, 2020. 13

2020
[57]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[58]

Comprehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Noel CF Codella, Maria Teodora Wetscherek, et al. Comprehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025

arXiv 2025
[59]

Multi-granularity cross- modal alignment for generalized medical visual representation learning

Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, and Lequan Yu. Multi-granularity cross- modal alignment for generalized medical visual representation learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022
[60]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471, 2017. doi: 10.1109/CVPR.2017.369

work page doi:10.1109/cvpr.2017.369 2017
[61]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, page 3876, 2022

2022
[62]

Warfield, K.H

S.K. Warfield, K.H. Zou, and W.M. Wells. Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation.IEEE Transactions on Medical Imaging, 23(7): 903–921, 2004. doi: 10.1109/TMI.2004.828354

work page doi:10.1109/tmi.2004.828354 2004
[63]

Totalsegmentator: robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. Totalsegmentator: robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

2023
[64]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceedings of the IEEE/CVF international conference on computer vision, pages 21372–21383, 2023

2023
[65]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018

2018
[66]

A generalizable 3d framework and model for self-supervised learning in medical imaging

Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G Krishnan, Anne L Martel, and Maged Goubran. A generalizable 3d framework and model for self-supervised learning in medical imaging. npj Digital Medicine, 8(1):639, 2025

2025
[67]

Advancing multimodal medical capabilities of gemini

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162, 2024

arXiv 2024
[68]

Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

arXiv 2025
[69]

Infusing fine-grained visual knowledge to vision-language models

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, and Ondrej Chum. Infusing fine-grained visual knowledge to vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4226–4235, 2025

2025
[70]

Chexworld: Exploring image world modeling for radiograph representation learning

Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, and Gao Huang. Chexworld: Exploring image world modeling for radiograph representation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20778–20788, 2025

2025
[71]

Siim-acr pneumothorax segmentation, 2019

Anna Zawacki, Carol Wu, George Shih, Julia Elliott, Mikhail Fomitchev, Mohannad Hussain, ParasLakhani, Phil Culliton, and Shunxing Bao. Siim-acr pneumothorax segmentation, 2019. Kaggle

2019
[72]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18123–18133, 2022. 14

2022
[73]

Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

2018
[74]

Knowledge-enhanced visual- language pre-training on chest radiology images.Nature Communications, 14(1):4542, 2023

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. Knowledge-enhanced visual- language pre-training on chest radiology images.Nature Communications, 14(1):4542, 2023

2023
[75]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022

2022
[76]

Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1): 566, 2025

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1): 566, 2025

2025
[77]

Advancing radiograph representation learn- ing with masked record modeling

Hong-Yu Zhou, Chenyu Lian, Liansheng Wang, and Yizhou Yu. Advancing radiograph representation learn- ing with masked record modeling. InThe Eleventh International Conference on Learning Representations, 2023

2023
[78]

There is

Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, and Rick Siow Mong Goh. Benchx: A unified benchmark framework for medical vision-language pretraining on chest x-rays. In Advances in Neural Information Processing Systems, volume 37, pages 6625–6647, 2024. 15 A Additional Visualization Results Gated Similarity Maps (GSM).We provide...

2024

[1] [1]

Hugo J. W. L. Aerts, Emmanuel Rios Velazquez, Ralph T. H. Leijenaar, Chintan Parmar, Patrick Grossmann, Sara Carvalho, Johan Bussink, René Monshouwer, Benjamin Haibe-Kains, Derek Rietveld, Frank Hoebers, Michelle M. Rietbergen, C. René Leemans, Andre Dekker, John Quackenbush, Robert J. Gillies, and Philippe Lambin. Decoding tumour phenotype by noninvasive...

2014

[2] [2]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Pith/arXiv arXiv 2025

[3] [3]

Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A. Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M. Summers, Bram van Ginneken, Michel Bilello, Patrick Bilic, Patrick F. Christ, Richard K. G. Do, Marc J. Gollub, Stephan H. Heckers, Henkjan Huisman, William R. Jarnagin, Maureen K. McHugo, ...

2022

[4] [4]

Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans.Medical physics, 38(2):9...

2011

[5] [5]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

Pith/arXiv arXiv 2025

[6] [6]

Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P Mistry, et al. Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

arXiv 2025

[7] [7]

Merlin: a computed tomography vision–language foundation model and dataset.Nature, pages 1–11, 2026

Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen, Jiaming Liu, Longchao Liu, Dave Van Veen, Syed Jamal Safdar Gardezi, Hongkun Yu, Magdalini Paschali, Zhihong Chen, et al. Merlin: a computed tomography vision–language foundation model and dataset.Nature, pages 1–11, 2026

2026

[8] [8]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuropean conference on computer vision, pages 1–21. Springer, 2022

2022

[9] [9]

Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis, 66:101797, December

Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria de la Iglesia-Vayá. Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical Image Analysis, 66:101797, December

[10] [10]

doi: 10.1016/j.media.2020.101797

ISSN 1361-8415. doi: 10.1016/j.media.2020.101797

work page doi:10.1016/j.media.2020.101797 2020

[11] [11]

Boosting vision semantic density with anatomy normality modeling for medical vision-language pre-training

Weiwei Cao, Jianpeng Zhang, Zhongyi Shui, Sinuo Wang, Zeli Chen, Xi Li, Le Lu, Xianghua Ye, Qi Zhang, Tingbo Liang, et al. Boosting vision semantic density with anatomy normality modeling for medical vision-language pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23041–23050, 2025

2025

[12] [12]

Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 10

Pith/arXiv arXiv 2022

[13] [13]

Generating radiology reports via memory- driven transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory- driven transformer. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1439–1449, 2020

2020

[14] [14]

Radiology: Artificial Intelligence3(2), e200254 (Mar 2021)

Errol Colak, Felipe C. Kitamura, Stephen B. Hobbs, Carol C. Wu, Matthew P. Lungren, Luciano M. Prevedello, Jayashree Kalpathy-Cramer, Robyn L. Ball, George Shih, Anouk Stein, Safwan S. Halabi, Emre Altinmakas, Meng Law, Parveen Kumar, Karam A. Manzalawi, Dennis Charles Nelson Rubio, Jacob W. Sechrist, Pauline Germaine, Eva Castro Lopez, Tomas Amerio, Push...

work page doi:10.1148/ryai.2021200254 2021

[15] [15]

Improving the Factual Correctness of Radiology Report Generation with Semantic Rewards

Jean-Benoit Delbrouck, Pierre Chambon, Christian Bluethgen, Emily Tsai, Omar Almusa, and Curtis Langlotz. Improving the factual correctness of radiology report generation with semantic rewards. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computa- tional Linguistics: EMNLP 2022, pages 4348–4360, Abu Dhabi, Un...

work page doi:10.18653/v1/2022.findings-emnlp.319 2022

[16] [16]

Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, Mar

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Steven E Shooshan, Louis Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23(2):304–310, Mar

[17] [17]

doi: 10.1093/jamia/ocv080

work page doi:10.1093/jamia/ocv080

[18] [18]

Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems

Michael Denkowski and Alon Lavie. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. InProceedings of the Sixth Workshop on Statistical Machine Translation, WMT ’11, page 85–91, USA, 2011. Association for Computational Linguistics. ISBN 9781937284121

2011

[19] [19]

Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes.Medical image analysis, 67:101857, 2021

Rachel Lea Draelos, David Dov, Maciej A Mazurowski, Joseph Y Lo, Ricardo Henao, Geoffrey D Rubin, and Lawrence Carin. Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes.Medical image analysis, 67:101857, 2021

2021

[20] [20]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

2018

[21] [21]

CRG score: A distribution-aware clinical metric for radiology report generation

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Bernhard Kainz, and Bjoern Menze. CRG score: A distribution-aware clinical metric for radiology report generation. InMedical Imaging with Deep Learning - Short Papers, 2025

2025

[22] [22]

Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

2026

[23] [23]

Roth, and Daguang Xu

Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R. Roth, and Daguang Xu. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 574–584, 2022

2022

[24] [24]

Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition

Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global- local representation learning framework for label-efficient medical image recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021

2021

[25] [25]

Mong, Safwan S

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y . Ng. Chexpert: A large chest radiogr...

work page doi:10.1609/aaai.v33i01.3301590 2019

[26] [26]

Nat Methods18(2), 203–211 (Feb 2021)

Fabian Isensee, Paul F. Jaeger, Simon A. A. Kohl, Jens Petersen, and Klaus H. Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation.Nature Methods, 18(2): 203–211, 2021. doi: 10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021

[27] [27]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 11

2019

[28] [28]

Video pretraining advances 3d deep learning on chest ct tasks

Alexander Ke, Shih-Cheng Huang, Chloe P O’Connell, Michal Klimont, Serena Yeung, and Pranav Rajpurkar. Video pretraining advances 3d deep learning on chest ct tasks. InMedical Imaging with Deep Learning, pages 758–774. PMLR, 2024

2024

[29] [29]

Fine-tuning can distort pretrained features and underperform out-of-distribution

Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

2022

[30] [30]

Carzero: Cross-attention alignment for radiology zero-shot classification

Haoran Lai, Qingsong Yao, Zihang Jiang, Rongsheng Wang, Zhiyang He, Xiaodong Tao, and S Kevin Zhou. Carzero: Cross-attention alignment for radiology zero-shot classification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11137–11146, 2024

2024

[31] [31]

UniCLIP: Unified framework for contrastive language-image pre-training

Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, and Junmo Kim. UniCLIP: Unified framework for contrastive language-image pre-training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022

[32] [32]

A structure- aware relation network for thoracic diseases detection and segmentation.IEEE Transactions on Medical Imaging, 40(8):2042–2052, 2021

Jie Lian, Jingyu Liu, Shu Zhang, Kai Gao, Xiaoqing Liu, Dingwen Zhang, and Yizhou Yu. A structure- aware relation network for thoracic diseases detection and segmentation.IEEE Transactions on Medical Imaging, 40(8):2042–2052, 2021

2042

[33] [33]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

2004

[34] [34]

Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C. Dvornek, Yuyan Ge, Zuwei Guo, Shouhei Hanaoka, Dongkyun Kim, Pablo Messina, Yang Lu, Denis Parra, Donghyun Son, Álvaro Soto, Aisha Urooj, René Vidal, Yosuke Yamagishi, Pingkun Yan, Zefan Yang, Ruichi Zhang, Yang Zhou, Leo Anthony C...

work page doi:10.1016/j.media.2025.103739 2024

[35] [35]

Medical Image Analysis42, 60–88 (Dec 2017).https: //doi.org/10.1016/j.media.2017.07.005

Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis.Medical Image Analysis, 42:60–88, 2017. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2017.07.005

work page doi:10.1016/j.media.2017.07.005 2017

[36] [36]

Chestx-det10: Chest x-ray dataset on detection of thoracic abnormali- ties.arXiv preprint arXiv:2006.10550, 2020

Jingyu Liu, Jie Lian, and Yizhou Yu. Chestx-det10: Chest x-ray dataset on detection of thoracic abnormali- ties.arXiv preprint arXiv:2006.10550, 2020

arXiv 2006

[37] [37]

TIPS: Text-Image Pretraining with Spatial Awareness

Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, and André Araujo. TIPS: Text-Image Pretraining with Spatial Awareness. InICLR, 2025

2025

[38] [38]

From softmax to sparsemax: A sparse model of attention and multi-label classification

Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. InInternational conference on machine learning, pages 1614–1623. PMLR, 2016

2016

[39] [39]

V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Pith/arXiv arXiv 2026

[40] [40]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.Scientific Data, 9(1):429, 2022

2022

[41] [41]

Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[42] [42]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

2022

[43] [43]

Radzero3d: Bridging self-supervised video models and medical vision-language alignment for zero-shot chest ct interpretation

Jonggwon Park, Kyoyun Choi, Byungmu Yoon, Hong Geun Cho, and Bumcheol Hwang. Radzero3d: Bridging self-supervised video models and medical vision-language alignment for zero-shot chest ct interpretation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6742–6749, 2025

2025

[44] [44]

Radzero: Similarity-based cross- attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability

Jonggwon Park, Byungmu Yoon, Soobum Kim, and Kyoyun Choi. Radzero: Similarity-based cross- attention for explainable vision-language alignment in chest x-ray with zero-shot multi-task capability. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[45] [45]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, 7(1):119–130, 2025

Fernando Perez-Garcia, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maxim- ilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, 7(1):119–130, 2025

2025

[46] [46]

Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework

Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, and Johan W Verjans. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11492–11...

2024

[47] [47]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[48] [48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine...

2021

[49] [49]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, ...

work page doi:10.18653/v1/d19-1410 2019

[50] [50]

V oxtell: Free-text promptable universal 3d medical image segmentation, 2025

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, and Klaus Maier- Hein. V oxtell: Free-text promptable universal 3d medical image segmentation, 2025

2025

[51] [51]

Drafting the future: the dawn of ai report generation in radiology.Radiology, 316(1):e243378, 2025

Jarrel CY Seah, Jennifer SN Tang, and Aengus Tran. Drafting the future: the dawn of ai report generation in radiology.Radiology, 316(1):e243378, 2025

2025

[52] [52]

Medgemma technical report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[53] [53]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia.Radiology

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia.Radiology. Artificial intelligence, 1(1), 2019

2019

[54] [54]

Large-scale and fine-grained vision-language pre-training for en- hanced CT image understanding

Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, and Ling Zhang. Large-scale and fine-grained vision-language pre-training for en- hanced CT image understanding. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[55] [55]

Dinov3.arXiv preprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[56] [56]

Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew P Lungren. Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. InEMNLP 2020-2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 1500–1519, 2020. 13

2020

[57] [57]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[58] [58]

Comprehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025

Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Noel CF Codella, Maria Teodora Wetscherek, et al. Comprehensive language-image pre-training for 3d medical image understanding.arXiv preprint arXiv:2510.15042, 2025

arXiv 2025

[59] [59]

Multi-granularity cross- modal alignment for generalized medical visual representation learning

Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, and Lequan Yu. Multi-granularity cross- modal alignment for generalized medical visual representation learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022

2022

[60] [60]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M. Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471, 2017. doi: 10.1109/CVPR.2017.369

work page doi:10.1109/cvpr.2017.369 2017

[61] [61]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2022, page 3876, 2022

2022

[62] [62]

Warfield, K.H

S.K. Warfield, K.H. Zou, and W.M. Wells. Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation.IEEE Transactions on Medical Imaging, 23(7): 903–921, 2004. doi: 10.1109/TMI.2004.828354

work page doi:10.1109/tmi.2004.828354 2004

[63] [63]

Totalsegmentator: robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. Totalsegmentator: robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

2023

[64] [64]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. InProceedings of the IEEE/CVF international conference on computer vision, pages 21372–21383, 2023

2023

[65] [65]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018

2018

[66] [66]

A generalizable 3d framework and model for self-supervised learning in medical imaging

Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G Krishnan, Anne L Martel, and Maged Goubran. A generalizable 3d framework and model for self-supervised learning in medical imaging. npj Digital Medicine, 8(1):639, 2025

2025

[67] [67]

Advancing multimodal medical capabilities of gemini

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162, 2024

arXiv 2024

[68] [68]

Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

arXiv 2025

[69] [69]

Infusing fine-grained visual knowledge to vision-language models

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, and Ondrej Chum. Infusing fine-grained visual knowledge to vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4226–4235, 2025

2025

[70] [70]

Chexworld: Exploring image world modeling for radiograph representation learning

Yang Yue, Yulin Wang, Chenxin Tao, Pan Liu, Shiji Song, and Gao Huang. Chexworld: Exploring image world modeling for radiograph representation learning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20778–20788, 2025

2025

[71] [71]

Siim-acr pneumothorax segmentation, 2019

Anna Zawacki, Carol Wu, George Shih, Julia Elliott, Mikhail Fomitchev, Mohannad Hussain, ParasLakhani, Phil Culliton, and Shunxing Bao. Siim-acr pneumothorax segmentation, 2019. Kaggle

2019

[72] [72]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18123–18133, 2022. 14

2022

[73] [73]

Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop.International Journal of Computer Vision, 126(10):1084–1102, 2018

2018

[74] [74]

Knowledge-enhanced visual- language pre-training on chest radiology images.Nature Communications, 14(1):4542, 2023

Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, and Yanfeng Wang. Knowledge-enhanced visual- language pre-training on chest radiology images.Nature Communications, 14(1):4542, 2023

2023

[75] [75]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022

2022

[76] [76]

Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1): 566, 2025

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.NPJ Digital Medicine, 8(1): 566, 2025

2025

[77] [77]

Advancing radiograph representation learn- ing with masked record modeling

Hong-Yu Zhou, Chenyu Lian, Liansheng Wang, and Yizhou Yu. Advancing radiograph representation learn- ing with masked record modeling. InThe Eleventh International Conference on Learning Representations, 2023

2023

[78] [78]

There is

Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, and Rick Siow Mong Goh. Benchx: A unified benchmark framework for medical vision-language pretraining on chest x-rays. In Advances in Neural Information Processing Systems, volume 37, pages 6625–6647, 2024. 15 A Additional Visualization Results Gated Similarity Maps (GSM).We provide...

2024