LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Myeongkyun Kang; Xiaoxiao Li; Yanting Yang

arxiv: 2603.19451 · v2 · submitted 2026-03-19 · 💻 cs.CV · cs.AI

LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray

Myeongkyun Kang , Yanting Yang , Xiaoxiao Li This is my paper

Pith reviewed 2026-05-15 08:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords chest x-rayfine-grained representationlocation-aware captioningphrase groundingimage retrievalvision-language modelcaptioning loss

0 comments

The pith

LoFi learns superior fine-grained chest X-ray representations by jointly optimizing location-aware captioning losses that supply region-level supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LoFi to address the difficulty of capturing spatially confined clinical findings in chest X-rays for retrieval and phrase grounding tasks. It combines sigmoid, captioning, and location-aware captioning losses optimized together with a lightweight large language model. The location-aware captioning loss supplies region-level supervision through grounding and dense captioning objectives without needing explicit region annotations. These representations feed into a fine-grained encoder that is integrated with retrieval-based in-context learning. Experiments demonstrate improved performance on retrieval and phrase grounding using the MIMIC-CXR and PadChest-GR datasets.

Core claim

LoFi jointly optimizes sigmoid, captioning, and location-aware captioning losses with a lightweight LLM so that the location-aware captioning loss supplies region-level supervision through grounding and dense captioning objectives, thereby producing fine-grained representations that improve chest X-ray retrieval and phrase grounding.

What carries the argument

The location-aware captioning loss, which supplies region-level supervision through grounding and dense captioning objectives.

If this is right

Superior retrieval performance on MIMIC-CXR and PadChest-GR datasets.
Enhanced phrase grounding across diverse settings through integration of a fine-grained encoder with retrieval-based in-context learning.
Fine-grained representation learning without the need for explicit region annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loss structure may support spatially precise analysis in other medical imaging domains.
Lightweight LLM usage points to possible efficiency gains for clinical deployment.
Retrieval-based in-context learning may extend to few-shot diagnostic tasks in related imaging modalities.

Load-bearing premise

The location-aware captioning loss enables effective region-level supervision through grounding and dense captioning objectives without requiring explicit region annotations.

What would settle it

Direct experiments showing no performance gain over baselines on retrieval or phrase grounding tasks using MIMIC-CXR and PadChest-GR would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.19451 by Myeongkyun Kang, Xiaoxiao Li, Yanting Yang.

**Figure 2.** Figure 2: Qualitative results with ground truth (dashed) and predictions (solid) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results with ground truth (dashed) and predictions (solid) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: (a)-(d) Ablation results of our method using all losses (filled), without [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Fine-grained representation learning is crucial for retrieval and phrase grounding in chest X-rays, where clinically relevant findings are often spatially confined. However, the lack of region-level supervision in contrastive models and the limited ability of large vision language models to capture fine-grained representations in external validation lead to suboptimal performance on these tasks. To address these limitations, we propose Location-aware Fine-grained representation learning (LoFi), which jointly optimizes sigmoid, captioning, and location-aware captioning losses using a lightweight large language model. The location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives, thereby facilitating fine-grained representation learning. Building upon these representations, we integrate a fine-grained encoder into retrieval-based in-context learning to enhance chest X-ray grounding across diverse settings. Extensive experiments demonstrate that our method achieves superior retrieval and phrase grounding performance on MIMIC-CXR and PadChest-GR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoFi adds a location-aware captioning loss to a lightweight LLM for finer CXR representations, but the abstract shows no numbers or mechanism details so the gains are unverified.

read the letter

The main point is that this paper combines sigmoid, captioning, and a location-aware captioning loss inside a lightweight LLM to push fine-grained representations for chest X-ray retrieval and phrase grounding. The location-aware term is meant to supply region-level supervision through grounding and dense captioning objectives, then the encoder gets plugged into retrieval-based in-context learning. That combination is the concrete new piece relative to prior contrastive and grounding work on MIMIC-CXR and PadChest-GR. It targets a real gap: standard models miss spatially confined findings and big VLMs often lose detail on external sets. The framing is straightforward and the downstream integration looks practical for clinical tools. The soft spot is the complete absence of any quantitative results, baselines, or ablation numbers in the abstract. Without those, it is impossible to tell whether the location-aware loss actually produces verifiable spatial signals or simply falls back to implicit cues from the LLM. The stress-test concern lands here: if no bounding-box pseudo-labels, attention maps, or coordinate regression are used, the claimed region-level benefit may not be real. The full paper needs to show exactly how the spatial information is generated and injected, plus the actual performance deltas and controls. This is for researchers building medical VLMs who care about grounding without heavy annotation. It deserves a serious referee to check the experiments and clarify the mechanism, even if the current write-up is thin on evidence.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes LoFi, a method for fine-grained representation learning in chest X-rays. It jointly optimizes sigmoid, captioning, and location-aware captioning losses in a lightweight LLM, where the location-aware loss is intended to enable region-level supervision via grounding and dense captioning objectives. A fine-grained encoder is then integrated into retrieval-based in-context learning. The central claim is that this yields superior retrieval and phrase grounding performance on the MIMIC-CXR and PadChest-GR datasets.

Significance. If the location-aware captioning loss can be shown to deliver verifiable region-level supervision without explicit annotations, the approach would offer a practical advance for fine-grained medical vision-language tasks, improving retrieval and grounding while remaining computationally lightweight. The emphasis on joint optimization of multiple losses and integration with in-context learning is a reasonable direction, but the significance is currently difficult to evaluate given the absence of implementation details on spatial signal generation.

major comments (1)

[§3] §3 (Method, location-aware captioning loss): the manuscript does not specify how spatial location signals are generated or injected into the loss (no bounding-box pseudo-labels, attention maps from a frozen detector, coordinate regression, or other mechanism is described). This is load-bearing for the central claim, as the asserted superiority on retrieval and phrase grounding is attributed to effective region-level supervision from this loss.

minor comments (2)

[Abstract] Abstract: the claim of 'superior retrieval and phrase grounding performance' is stated without any quantitative numbers, baselines, or ablation results, which reduces the informativeness of the summary.
[§3] The paper would benefit from a clear diagram or pseudocode showing the exact flow of the location-aware loss computation and how it interacts with the grounding/dense captioning objectives.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [§3] §3 (Method, location-aware captioning loss): the manuscript does not specify how spatial location signals are generated or injected into the loss (no bounding-box pseudo-labels, attention maps from a frozen detector, coordinate regression, or other mechanism is described). This is load-bearing for the central claim, as the asserted superiority on retrieval and phrase grounding is attributed to effective region-level supervision from this loss.

Authors: We agree that the current version of the manuscript does not provide sufficient detail on the generation and injection of spatial location signals into the location-aware captioning loss. This omission makes it difficult to fully evaluate the region-level supervision mechanism. In the revised manuscript, we will expand §3 to explicitly describe how spatial signals are produced (via pseudo-labels derived from the grounding and dense captioning objectives) and how they are incorporated into the loss computation. We believe this clarification will directly address the concern and strengthen support for the central claims regarding retrieval and phrase grounding performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a training recipe without self-referential derivations

full rationale

The paper describes LoFi as jointly optimizing sigmoid, captioning, and location-aware captioning losses in a lightweight LLM, with the location-aware loss claimed to enable region-level supervision via grounding and dense captioning. No equations, derivations, or fitted parameters are presented that reduce any prediction or result to the inputs by construction. Performance claims rest on experimental results on MIMIC-CXR and PadChest-GR rather than any closed-form equivalence or self-citation chain. The approach is self-contained as an empirical training procedure without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, axioms, or invented entities beyond standard assumptions of contrastive and captioning training.

pith-pipeline@v0.9.0 · 5455 in / 983 out tokens · 32458 ms · 2026-05-15T08:05:22.115780+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly optimizes sigmoid, captioning, and location-aware captioning losses... Ls = -E log σ(δ(I,T) EI(I)⊤ ET(T)), Lc = -E log pLLM(T|EI(I),Pc), Lg = -E log pLLM(b|EI(I),Pg,s)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

location-aware captioning loss enables region-level supervision through grounding and dense captioning objectives

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)

work page arXiv 2024
[2]

NEJM AI2(7), AIdbp2401120 (2025)

de Castro, D.C., Bustos, A., Bannur, S., Hyland, S.L., Bouzid, K., Wetscherek, M.T., Sánchez-Valverde, M.D., Jaques-Pérez, L., Pérez-Rodríguez, L., Takeda, K., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. NEJM AI2(7), AIdbp2401120 (2025)

work page 2025
[3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Chen, Z., Zhou, Y., Tran, A., Zhao, J., Wan, L., Ooi, G.S.K., Cheng, L.T.E., Thng, C.H., Xu, X., Liu, Y., et al.: Medical phrase grounding with region-phrase context contrastive alignment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 371–381. Springer (2023)

work page 2023
[4]

In: AAAI 2024 Spring Symposium on Clinical Foundation Models (2024)

Chen, Z., Varma, M., Delbrouck, J.B., Paschali, M., Blankemeier, L., Van Veen, D., Valanarasu, J.M.J., Youssef, A., Cohen, J.P., Reis, E.P., et al.: Chexagent: Towards a foundation model for chest x-ray interpretation. In: AAAI 2024 Spring Symposium on Clinical Foundation Models (2024)

work page 2024
[5]

Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T.M., Vogt, J.E., et al.: Radvlm: A multitask conversational vision-language model for radiology. arXiv preprint arXiv:2502.03333 (2025)

work page arXiv 2025
[6]

Scientific Data11(1), 511 (2024)

Gaggion, N., Mosquera, C., Mansilla, L., Saidman, J.M., Aineseder, M., Milone, D.H., Ferrante, E.: Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientific Data11(1), 511 (2024)

work page 2024
[7]

Haghighi, F., Taher, M.R.H., Gotway, M.B., Liang, J.: Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial? Medical Image Analysis94, 103086 (2024)

work page 2024
[8]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[9]

arXiv preprint arXiv:2510.12798 (2025)

Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025) 10 M. Kang et al

work page arXiv 2025
[10]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

work page 2019
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lai, H., Yao, Q., Jiang, Z., Wang, R., He, Z., Tao, X., Zhou, S.K.: Carzero: Cross- attention alignment for radiology zero-shot classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11137– 11146 (2024)

work page 2024
[12]

arXiv preprint arXiv:2508.04572 (2025)

Li, J., Liu, C., Bai, W., Liu, M., Arcucci, R., Bercea, C.I., Schnabel, J.A.: Knowl- edge to sight: Reasoning over visual attributes via knowledge decomposition for abnormality grounding. arXiv preprint arXiv:2508.04572 (2025)

work page arXiv 2025
[13]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 21310–21320 (2025)

work page 2025
[14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lozano, A., Sun, M.W., Burgess, J., Chen, L., Nirschl, J.J., Gu, J., Lopez, I., Aklilu, J., Rau, A., Katzer, A.W., et al.: Biomedica: An open biomedical image- caption archive, dataset, and vision-language models derived from scientific litera- ture. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19724–19735 (2025)

work page 2025
[15]

In: Medical Imaging with Deep Learning

Meissen, F., Müller, P., Kaissis, G., Rueckert, D.: Robust detection outcome: A metric for pathology detection in medical images. In: Medical Imaging with Deep Learning. pp. 568–585. PMLR (2024)

work page 2024
[16]

In: European Conference on Computer Vision

Müller, P., Kaissis, G., Rueckert, D.: Chex: Interactive localization and region description in chest x-rays. In: European Conference on Computer Vision. pp. 92–

work page
[17]

Phys- ioNet (Jul 2025)

Müller, P., Jungmann, F., Kaissis, G., Rueckert, D.: MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images. Phys- ioNet (Jul 2025)

work page 2025
[18]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Image and Vision Computing107, 104117 (2021)

Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing107, 104117 (2021)

work page 2021
[20]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multi- modal few-shot learning with frozen language models. Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

work page 2021
[23]

Advances in Neural Information Processing Systems37, 116355–116387 (2024) LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray 11

Wan, B., Tschannen, M., Xian, Y., Pavetic, F., Alabdulmohsin, I.M., Wang, X., Susano Pinto, A., Steiner, A., Beyer, L., Zhai, X.: Locca: Visual pretraining with location-aware captioners. Advances in Neural Information Processing Systems37, 116355–116387 (2024) LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray 11

work page 2024
[24]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S.: Flair: Vlm with fine- grained language-informed image representations. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24884–24894 (2025)

work page 2025
[25]

In: Proceedings of the AAAI conference on artificial intelligence

Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., Wang, L.: An empirical study of gpt-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 3081–3089 (2022)

work page 2022
[26]

Nature Communications16(1), 3108 (2025)

Zambrano Chaves, J.M., Huang, S.C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., Wang, F., Xie, Y., Khademi, M., Yang, Z., et al.: A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings. Nature Communications16(1), 3108 (2025)

work page 2025
[27]

NEJM AI2(1), AIoa2400640 (2025)

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI2(1), AIoa2400640 (2025)

work page 2025
[28]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

Zhang, T., Zhao, Z., Wu, C., Zhou, X., Zhang, Y., Wang, Y., Xie, W.: Radir: A scalable framework for multi-grained medical image retrieval via radiology report mining. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 508–518. Springer (2025)

work page 2025

[1] [1]

Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024)

work page arXiv 2024

[2] [2]

NEJM AI2(7), AIdbp2401120 (2025)

de Castro, D.C., Bustos, A., Bannur, S., Hyland, S.L., Bouzid, K., Wetscherek, M.T., Sánchez-Valverde, M.D., Jaques-Pérez, L., Pérez-Rodríguez, L., Takeda, K., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. NEJM AI2(7), AIdbp2401120 (2025)

work page 2025

[3] [3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Chen, Z., Zhou, Y., Tran, A., Zhao, J., Wan, L., Ooi, G.S.K., Cheng, L.T.E., Thng, C.H., Xu, X., Liu, Y., et al.: Medical phrase grounding with region-phrase context contrastive alignment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 371–381. Springer (2023)

work page 2023

[4] [4]

In: AAAI 2024 Spring Symposium on Clinical Foundation Models (2024)

Chen, Z., Varma, M., Delbrouck, J.B., Paschali, M., Blankemeier, L., Van Veen, D., Valanarasu, J.M.J., Youssef, A., Cohen, J.P., Reis, E.P., et al.: Chexagent: Towards a foundation model for chest x-ray interpretation. In: AAAI 2024 Spring Symposium on Clinical Foundation Models (2024)

work page 2024

[5] [5]

Available: [https://arxiv.org/abs/2502.03333](https://arxiv.org/abs/2502.03333)

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T.M., Vogt, J.E., et al.: Radvlm: A multitask conversational vision-language model for radiology. arXiv preprint arXiv:2502.03333 (2025)

work page arXiv 2025

[6] [6]

Scientific Data11(1), 511 (2024)

Gaggion, N., Mosquera, C., Mansilla, L., Saidman, J.M., Aineseder, M., Milone, D.H., Ferrante, E.: Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientific Data11(1), 511 (2024)

work page 2024

[7] [7]

Haghighi, F., Taher, M.R.H., Gotway, M.B., Liang, J.: Self-supervised learning for medical image analysis: Discriminative, restorative, or adversarial? Medical Image Analysis94, 103086 (2024)

work page 2024

[8] [8]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022

[9] [9]

arXiv preprint arXiv:2510.12798 (2025)

Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025) 10 M. Kang et al

work page arXiv 2025

[10] [10]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

work page 2019

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lai, H., Yao, Q., Jiang, Z., Wang, R., He, Z., Tao, X., Zhou, S.K.: Carzero: Cross- attention alignment for radiology zero-shot classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11137– 11146 (2024)

work page 2024

[12] [12]

arXiv preprint arXiv:2508.04572 (2025)

Li, J., Liu, C., Bai, W., Liu, M., Arcucci, R., Bercea, C.I., Schnabel, J.A.: Knowl- edge to sight: Reasoning over visual attributes via knowledge decomposition for abnormality grounding. arXiv preprint arXiv:2508.04572 (2025)

work page arXiv 2025

[13] [13]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision

Liu,B.,Zou,K.,Zhan,L.M.,Lu,Z.,Dong,X.,Chen,Y.,Xie,C.,Cao,J.,Wu,X.M., Fu, H.: Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision. pp. 21310–21320 (2025)

work page 2025

[14] [14]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Lozano, A., Sun, M.W., Burgess, J., Chen, L., Nirschl, J.J., Gu, J., Lopez, I., Aklilu, J., Rau, A., Katzer, A.W., et al.: Biomedica: An open biomedical image- caption archive, dataset, and vision-language models derived from scientific litera- ture. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19724–19735 (2025)

work page 2025

[15] [15]

In: Medical Imaging with Deep Learning

Meissen, F., Müller, P., Kaissis, G., Rueckert, D.: Robust detection outcome: A metric for pathology detection in medical images. In: Medical Imaging with Deep Learning. pp. 568–585. PMLR (2024)

work page 2024

[16] [16]

In: European Conference on Computer Vision

Müller, P., Kaissis, G., Rueckert, D.: Chex: Interactive localization and region description in chest x-rays. In: European Conference on Computer Vision. pp. 92–

work page

[17] [17]

Phys- ioNet (Jul 2025)

Müller, P., Jungmann, F., Kaissis, G., Rueckert, D.: MIMIC-Ext-CXR-QBA: A Structured, Tagged, and Localized Visual Question Answering Dataset with Question-Box-Answer Triplets and Scene Graphs for Chest X-ray Images. Phys- ioNet (Jul 2025)

work page 2025

[18] [18]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Image and Vision Computing107, 104117 (2021)

Solovyev, R., Wang, W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing107, 104117 (2021)

work page 2021

[20] [20]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multi- modal few-shot learning with frozen language models. Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

work page 2021

[23] [23]

Advances in Neural Information Processing Systems37, 116355–116387 (2024) LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray 11

Wan, B., Tschannen, M., Xian, Y., Pavetic, F., Alabdulmohsin, I.M., Wang, X., Susano Pinto, A., Steiner, A., Beyer, L., Zhai, X.: Locca: Visual pretraining with location-aware captioners. Advances in Neural Information Processing Systems37, 116355–116387 (2024) LoFi: Location-Aware Fine-Grained Representation Learning for Chest X-ray 11

work page 2024

[24] [24]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S.: Flair: Vlm with fine- grained language-informed image representations. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24884–24894 (2025)

work page 2025

[25] [25]

In: Proceedings of the AAAI conference on artificial intelligence

Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., Wang, L.: An empirical study of gpt-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 3081–3089 (2022)

work page 2022

[26] [26]

Nature Communications16(1), 3108 (2025)

Zambrano Chaves, J.M., Huang, S.C., Xu, Y., Xu, H., Usuyama, N., Zhang, S., Wang, F., Xie, Y., Khademi, M., Yang, Z., et al.: A clinically accessible small multimodal radiology model and evaluation metric for chest x-ray findings. Nature Communications16(1), 3108 (2025)

work page 2025

[27] [27]

NEJM AI2(1), AIoa2400640 (2025)

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: A multimodal biomedical foundation model trained from fifteen million image–text pairs. NEJM AI2(1), AIoa2400640 (2025)

work page 2025

[28] [28]

In: International Conference on Medical Image Computing and Computer- Assisted Intervention

Zhang, T., Zhao, Z., Wu, C., Zhou, X., Zhang, Y., Wang, Y., Xie, W.: Radir: A scalable framework for multi-grained medical image retrieval via radiology report mining. In: International Conference on Medical Image Computing and Computer- Assisted Intervention. pp. 508–518. Springer (2025)

work page 2025