CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Christian Bluethgen; Curtis Langlotz; Sergios Gatidis

arxiv: 2606.08420 · v2 · pith:RJ7GDQZZnew · submitted 2026-06-07 · 💻 cs.CV

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Sergios Gatidis , Curtis Langlotz , Christian Bluethgen This is my paper

Pith reviewed 2026-06-27 18:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelschest radiographsanatomical segmentationautoregressive predictionsynthetic datadomain shiftmedical imagingsample efficiency

0 comments

The pith

Training vision-language models to autoregressively output anatomical segmentation masks from chest X-rays produces representations competitive with dedicated convolutional networks and more robust to real-world data shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to embed fine-grained anatomical structure into vision-language models without adding special decoder heads. Instead of global image-text alignment, the model learns to generate segmentation masks token by token using next-token prediction. Supervision comes from synthetic radiographs created by projecting CT volumes and their segmentations into 2D. This yields performance on par with U-Nets on matching data and better geometric consistency when tested on actual chest X-rays. Models pretrained this way also adapt to new localization tasks with fewer labeled examples.

Core claim

By training a pretrained vision-language model to generate anatomical segmentation masks via next-token prediction on synthetic chest radiographs derived from CT, the approach embeds explicit anatomical knowledge directly into the generative objective, achieving performance comparable to specialized convolutional models in-distribution while demonstrating improved geometric robustness under domain shift to real CXR data and better sample efficiency for novel tasks.

What carries the argument

autoregressive token-space supervision, where the model generates anatomical segmentation masks through next-token prediction instead of using task-specific decoder heads

Load-bearing premise

Forward-projected CT segmentation labels create supervision signals that are anatomically accurate and close enough in distribution to real chest radiographs for the representations to transfer effectively.

What would settle it

A direct comparison showing that a model trained only on the synthetic data underperforms a standard U-Net by a large margin on segmentation of real chest radiographs would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2606.08420 by Christian Bluethgen, Curtis Langlotz, Sergios Gatidis.

**Figure 1.** Figure 1: Overview of CheXanatomy. CT volumes with anatomic labels are projected into synthetic CXRs, and bounding boxes and segmentation masks are encoded into structured tokens. A pretrained vision–language model (based on the Paligemma architecture) is fine-tuned to autoregressively generate anatomical bounding boxes and segmentations via next-token prediction. The trained model approaches convolutional baseline… view at source ↗

**Figure 2.** Figure 2: Segmentation performance on synthetic CT-RATE (top) and combined realworld radiographs from CheXmask and VinDr-RibCXR (bottom). Boxplots show Dice, IoU, Hausdorff distance, and Fourier descriptor distance across model scales and input resolutions. On synthetic data, the best PaliGemma models approach U-Net in overlap metrics and achieve lower boundary and shape errors. Under domain shift, PaliGemma mainta… view at source ↗

**Figure 3.** Figure 3: Representative examples for out-of-distribution segmentation performance of anatomy-trained VLMs and a specialized U-Net compared to ground truth segmentations. Only VLMs with frozen vision encoders during fine-tuning are shown. Top row: CheXmask example (lungs and heart). Bottom row: VinDr-RibCXR example (bilateral rib structures). Compared to U-Net, anatomy-trained VLMs demonstrate more stable geometric… view at source ↗

**Figure 4.** Figure 4: Performance scaling on the RadVLM grounding benchmark comparing anatomy-pretrained and baseline models across training data fractions. Top: Success rate and mean IoU as a function of available training data (0–100%). Anatomypretrained models show clear advantages in the low-data regime (0–3%), with performance converging as more task-specific data becomes available. Bottom: Representative grounding exam… view at source ↗

read the original abstract

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CheXanatomy adds anatomical masks to a VLM via autoregressive token prediction trained on CT-derived synthetic CXR, but the transfer claims lack any reported numbers or validation of the synthetic-to-real gap.

read the letter

The paper's main move is to treat segmentation masks as sequences the VLM predicts token by token, using forward-projected CT labels on synthetic radiographs for supervision. This keeps the model inside its original generative setup instead of adding decoder heads. The ablations on scale, resolution, and whether to fine-tune the vision encoder are clear and useful to see.

The synthetic data route is a practical way to get dense labels at volume. If the projections hold up, it could help few-shot localization without task-specific training.

The load-bearing part is the domain gap. The abstract claims comparable in-distribution performance to a U-Net plus better geometric robustness and sample efficiency on real CXR, yet supplies no metrics, error bars, or checks on whether the synthetic images match real intensity profiles and artifacts. Without those, the robustness and efficiency gains stay unverified. The stress-test concern about projection artifacts and missing soft-tissue contrast looks like it still applies.

This is aimed at groups already working on medical VLMs who want to test spatial grounding inside the same model. It is concrete enough to deserve referee time so the actual numbers and data pipeline can be examined, even if revisions are needed on the evaluation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CheXanatomy, a framework that integrates explicit anatomical knowledge into pretrained vision-language models for chest radiographs via autoregressive next-token prediction of segmentation masks. Supervision is obtained by synthesizing radiographs from CT volumes and forward-projecting CT segmentation labels into 2D masks. The model is evaluated against a U-Net baseline on both synthetic and real CXR data, with ablations on model scale, input resolution, and vision-encoder fine-tuning. The central claims are that autoregressive anatomical supervision yields performance comparable to specialized convolutional models in-distribution, improved geometric robustness under domain shift to real CXR, and better sample efficiency when adapting to novel localization tasks under limited supervision.

Significance. If the empirical claims hold after verification of the synthetic-to-real transfer, the work would advance anatomy-aware medical VLMs by demonstrating that generative supervision on projected anatomical labels can produce spatially grounded representations without task-specific decoder heads. This could improve robustness and few-shot efficiency for localization tasks in medical imaging while leveraging scalable synthetic supervision.

major comments (2)

[Abstract and data-synthesis description] Abstract and data-synthesis description: the central claims of comparable in-distribution performance and improved geometric robustness under domain shift rest on the unverified assumption that forward-projected CT segmentation labels produce supervision signals that are both anatomically accurate and distributionally close to real chest radiographs. No quantitative measures of domain gap (e.g., intensity histogram divergence, projection artifact analysis) or failure-mode examination are reported, which directly undermines the transfer and robustness assertions.
[Evaluation section] Evaluation section: the abstract asserts 'performance comparable to specialized convolutional models' and 'improved geometric robustness' yet supplies no quantitative metrics, error bars, Dice/IoU scores, or ablation tables, leaving the strength of the central empirical claims impossible to assess from the provided description.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., Dice score or robustness delta) to support the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the synthetic data pipeline and clearer presentation of quantitative results. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and data-synthesis description] Abstract and data-synthesis description: the central claims of comparable in-distribution performance and improved geometric robustness under domain shift rest on the unverified assumption that forward-projected CT segmentation labels produce supervision signals that are both anatomically accurate and distributionally close to real chest radiographs. No quantitative measures of domain gap (e.g., intensity histogram divergence, projection artifact analysis) or failure-mode examination are reported, which directly undermines the transfer and robustness assertions.

Authors: We agree that explicit quantitative characterization of the synthetic-to-real domain gap would strengthen the manuscript. The current version relies on downstream performance on real CXR data as evidence of successful transfer and robustness, but does not report intensity histogram divergence, projection artifact statistics, or a dedicated failure-mode analysis of the forward-projection process. In revision we will add these analyses, including side-by-side intensity distribution comparisons and qualitative examples of projection artifacts. revision: yes
Referee: [Evaluation section] Evaluation section: the abstract asserts 'performance comparable to specialized convolutional models' and 'improved geometric robustness' yet supplies no quantitative metrics, error bars, Dice/IoU scores, or ablation tables, leaving the strength of the central empirical claims impossible to assess from the provided description.

Authors: The full manuscript contains the requested quantitative results: Dice and IoU scores with standard-error bars across multiple runs, direct comparisons against the U-Net baseline on both synthetic and real data, and ablation tables for model scale, input resolution, and vision-encoder fine-tuning. These appear in the Evaluation section with accompanying figures and tables. The abstract follows the conventional practice of summarizing findings at a high level without numerical values. We can expand the abstract to include key metrics if the editor prefers. revision: partial

Circularity Check

0 steps flagged

No circularity; external CT supervision and standard eval keep claims independent

full rationale

The paper trains a VLM on forward-projected CT segmentations to generate 2D masks autoregressively, then evaluates transfer to real CXR and few-shot tasks. No equations, fitted parameters, or self-citations reduce the reported performance, robustness, or sample-efficiency gains to quantities defined inside the paper. Supervision originates from independent CT volumes; evaluation uses held-out synthetic and real data. This matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CT-to-2D projections supply faithful anatomical labels for real CXR training and that next-token prediction on masks induces spatially grounded representations. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Forward projection of CT segmentation labels yields 2D masks that are anatomically consistent and sufficiently representative of real chest radiograph anatomy for model training and transfer.
Invoked to justify scalable supervision and evaluation on real CXR data.

pith-pipeline@v0.9.1-grok · 5759 in / 1381 out tokens · 22148 ms · 2026-06-27T18:59:22.263451+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 23 canonical work pages · 6 internal anchors

[1]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Susano Pinto, A., Kolesnikov, A., Wang, X., Salz, D., Neu- mann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bošnjak, M., Chen, X., Min- derer, M., Voigtlaender, P., Bic...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Chen, T., et al.: Pix2seq: A language modeling framework for object detection (2021), https://arxiv.org/abs/2109.10852

work page arXiv 2021
[3]

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J.E., Kluckert, J., Frauen- felder, T., Blüthgen, C., Nooralahzadeh, F., Krauthammer, M.: RadVLM: A mul- titask conversational vision-language model for chest x-ray interpretation (2025), https://arxiv.org/abs/2502.03333

work page arXiv 2025
[4]

Dong, Z., Wu, W., Hao, J., Chen, T., Weng, Z., Zhou, B.: AnyCXR: Human anatomy segmentation of chest x-ray at any acquisition position using multi-stage domain randomized synthetic data with imperfect annotations and conditional joint annotation regularization learning (2025), https://arxiv.org/abs/2512.17263

work page arXiv 2025
[5]

Scientific Data11(1), 511 (2024)

Gaggion, N., Mosquera, C., Mansilla, L., Saidman, J.M., Aineseder, M., Milone, D.H., Ferrante, E.: CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientific Data11(1), 511 (2024). https://doi.org/10.1038/s41597-024-03358-1

work page doi:10.1038/s41597-024-03358-1 2024
[6]

Medical Image Analysis10(1), 19–40 (2006)

van Ginneken, B., Stegmann, M.B., Loog, M.: Segmentation of anatomi- cal structures in chest radiographs using supervised methods: A comparative study on a public database. Medical Image Analysis10(1), 19–40 (2006). https://doi.org/10.1016/j.media.2005.02.002

work page doi:10.1016/j.media.2005.02.002 2006
[7]

In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings

Gopalakrishnan, V., Golland, P.: Fast auto-differentiable digitally reconstructed radiographs for solving inverse problems in intraoperative imaging. In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings. pp. 1–11. Springer- Verlag, Berlin, Heidelberg (2022). https:/...

work page doi:10.1007/978-3-031-23179-7_1 2022
[8]

Gen- eralist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.Nature Biomedical Engineering, 2026

Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Das- delen, M.F., Durugol, O.F., Wittmann, B., Amiranashvili, T., Simsar, E., Sim- sar, M., Erdemir, E.B., Alanbay, A., Sekuboyina, A., Lafci, B., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generalist foundation models from a multimodal dataset for 3D computed tomography. Nature Biome...

work page doi:10.1038/s41551-025-01599-y 2025
[9]

arXiv preprint arXiv:2406.03688 (2024)

Hou, B., Zhu, Q., Mathai, T.S., Jin, Q., Lu, Z., Summers, R.M.: Shadow and light: Digitally reconstructed radiographs for disease classification. arXiv preprint arXiv:2406.03688 (2024)

work page arXiv 2024
[10]

In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

2022
[11]

Isensee, P

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods (2021). https://doi.org/10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021
[12]

Kirillov, A., et al.: Segment anything (2023), https://arxiv.org/abs/2304.02643 CheXanatomy 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Koleilat, T., Asgariandehkordi, H., Rivaz, H., Xiao, Y.: MedCLIP- SAMv2: Towards universal text-driven medical image segmentation (2024), https://arxiv.org/abs/2409.19483

work page arXiv 2024
[14]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Liu, J., Zhang, Y., Chen, J.N., Xiao, J., Lu, Y., Landman, B.A., Yuan, Y., Yuille, A., Tang, Y., Zhou, Z.: CLIP-driven universal model for organ segmentation and tumor detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023
[15]

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts (2021), https://arxiv.org/abs/2112.10003

work page arXiv 2021
[16]

arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

Nguyen, H.C., Le, T.T., Pham, H.H., Nguyen, H.Q.: VinDr-RibCXR: A benchmark dataset for automatic segmentation and labeling of individual ribs on chest x-rays. arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

work page arXiv 2021
[17]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021), https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Le Lan, C., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., et al.: Gemma 2: Improving open language...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer (2015), https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Seibold, C., Jaus, A., Fink, M.A., Kim, M., Reiß, S., Herrmann, K., Kleesiek, J., Stiefelhagen, R.: Accurate fine-grained segmentation of human anatomy in radio- graphs via volumetric pseudo-labeling (2023), https://arxiv.org/abs/2306.03934

work page arXiv 2023
[21]

In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

Seibold, C.M., Reiß, S., Sarfraz, M.S., Fink, M.A., Mayer, V., Sellner, J., Kim, M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed annotations of chest x-rays via ct projection for report understanding. In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

2022
[22]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Steiner, A., Susano Pinto, A., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Gritsenko,A.,Minderer,M.,Sherbondy,A.,Long,S.,Qin,S.,Ingle,R.,Bugliarello, E., Kazemzadeh, S., Mesnard, T., Alabdulmohsin, I., Beyer, L., et al.: PaliGemma 2: A family of versatile vision–language models for transfer. arXiv preprint (2024), https://arxiv.org/abs/2412.03555, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Nature Biomedical Engineering6, 1399–1406 (2022)

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

work page doi:10.1038/s41551-022-00936-9 2022
[24]

Wang, X., et al.: SegGPT: Segmenting everything in context (2023), https://arxiv.org/abs/2304.03284

work page arXiv 2023
[25]

Radiology: Artificial Intelligence (Jul 2023), https://pubs.rsna.org/doi/10.1148/ryai.230024

Wasserthal, J., Meyer, M., Hanneman, K., et al.: TotalSegmentator: ro- bust segmentation of 104 anatomical structures in ct images. Radiology: Ar- tificial Intelligence5(5), e230024 (2023). https://doi.org/10.1148/ryai.230024, https://arxiv.org/abs/2208.05868

work page doi:10.1148/ryai.230024 2023
[26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975–11986 (October 2023) 12 S. Gatidis et al

2023
[27]

Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text (2020), https://arxiv.org/abs/2010.00747

work page arXiv 2020

[1] [1]

PaliGemma: A versatile 3B VLM for transfer

Beyer, L., Steiner, A., Susano Pinto, A., Kolesnikov, A., Wang, X., Salz, D., Neu- mann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bošnjak, M., Chen, X., Min- derer, M., Voigtlaender, P., Bic...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Chen, T., et al.: Pix2seq: A language modeling framework for object detection (2021), https://arxiv.org/abs/2109.10852

work page arXiv 2021

[3] [3]

Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J.E., Kluckert, J., Frauen- felder, T., Blüthgen, C., Nooralahzadeh, F., Krauthammer, M.: RadVLM: A mul- titask conversational vision-language model for chest x-ray interpretation (2025), https://arxiv.org/abs/2502.03333

work page arXiv 2025

[4] [4]

Dong, Z., Wu, W., Hao, J., Chen, T., Weng, Z., Zhou, B.: AnyCXR: Human anatomy segmentation of chest x-ray at any acquisition position using multi-stage domain randomized synthetic data with imperfect annotations and conditional joint annotation regularization learning (2025), https://arxiv.org/abs/2512.17263

work page arXiv 2025

[5] [5]

Scientific Data11(1), 511 (2024)

Gaggion, N., Mosquera, C., Mansilla, L., Saidman, J.M., Aineseder, M., Milone, D.H., Ferrante, E.: CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientific Data11(1), 511 (2024). https://doi.org/10.1038/s41597-024-03358-1

work page doi:10.1038/s41597-024-03358-1 2024

[6] [6]

Medical Image Analysis10(1), 19–40 (2006)

van Ginneken, B., Stegmann, M.B., Loog, M.: Segmentation of anatomi- cal structures in chest radiographs using supervised methods: A comparative study on a public database. Medical Image Analysis10(1), 19–40 (2006). https://doi.org/10.1016/j.media.2005.02.002

work page doi:10.1016/j.media.2005.02.002 2006

[7] [7]

In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings

Gopalakrishnan, V., Golland, P.: Fast auto-differentiable digitally reconstructed radiographs for solving inverse problems in intraoperative imaging. In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings. pp. 1–11. Springer- Verlag, Berlin, Heidelberg (2022). https:/...

work page doi:10.1007/978-3-031-23179-7_1 2022

[8] [8]

Gen- eralist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.Nature Biomedical Engineering, 2026

Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Das- delen, M.F., Durugol, O.F., Wittmann, B., Amiranashvili, T., Simsar, E., Sim- sar, M., Erdemir, E.B., Alanbay, A., Sekuboyina, A., Lafci, B., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generalist foundation models from a multimodal dataset for 3D computed tomography. Nature Biome...

work page doi:10.1038/s41551-025-01599-y 2025

[9] [9]

arXiv preprint arXiv:2406.03688 (2024)

Hou, B., Zhu, Q., Mathai, T.S., Jin, Q., Lu, Z., Summers, R.M.: Shadow and light: Digitally reconstructed radiographs for disease classification. arXiv preprint arXiv:2406.03688 (2024)

work page arXiv 2024

[10] [10]

In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

2022

[11] [11]

Isensee, P

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods (2021). https://doi.org/10.1038/s41592-020-01008-z

work page doi:10.1038/s41592-020-01008-z 2021

[12] [12]

Kirillov, A., et al.: Segment anything (2023), https://arxiv.org/abs/2304.02643 CheXanatomy 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Koleilat, T., Asgariandehkordi, H., Rivaz, H., Xiao, Y.: MedCLIP- SAMv2: Towards universal text-driven medical image segmentation (2024), https://arxiv.org/abs/2409.19483

work page arXiv 2024

[14] [14]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Liu, J., Zhang, Y., Chen, J.N., Xiao, J., Lu, Y., Landman, B.A., Yuan, Y., Yuille, A., Tang, Y., Zhou, Z.: CLIP-driven universal model for organ segmentation and tumor detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023

[15] [15]

Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts (2021), https://arxiv.org/abs/2112.10003

work page arXiv 2021

[16] [16]

arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

Nguyen, H.C., Le, T.T., Pham, H.H., Nguyen, H.Q.: VinDr-RibCXR: A benchmark dataset for automatic segmentation and labeling of individual ribs on chest x-rays. arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

work page arXiv 2021

[17] [17]

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021), https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Le Lan, C., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., et al.: Gemma 2: Improving open language...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer (2015), https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015

[20] [20]

Seibold, C., Jaus, A., Fink, M.A., Kim, M., Reiß, S., Herrmann, K., Kleesiek, J., Stiefelhagen, R.: Accurate fine-grained segmentation of human anatomy in radio- graphs via volumetric pseudo-labeling (2023), https://arxiv.org/abs/2306.03934

work page arXiv 2023

[21] [21]

In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

Seibold, C.M., Reiß, S., Sarfraz, M.S., Fink, M.A., Mayer, V., Sellner, J., Kim, M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed annotations of chest x-rays via ct projection for report understanding. In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

2022

[22] [22]

PaliGemma 2: A Family of Versatile VLMs for Transfer

Steiner, A., Susano Pinto, A., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Gritsenko,A.,Minderer,M.,Sherbondy,A.,Long,S.,Qin,S.,Ingle,R.,Bugliarello, E., Kazemzadeh, S., Mesnard, T., Alabdulmohsin, I., Beyer, L., et al.: PaliGemma 2: A family of versatile vision–language models for transfer. arXiv preprint (2024), https://arxiv.org/abs/2412.03555, a...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Nature Biomedical Engineering6, 1399–1406 (2022)

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

work page doi:10.1038/s41551-022-00936-9 2022

[24] [24]

Wang, X., et al.: SegGPT: Segmenting everything in context (2023), https://arxiv.org/abs/2304.03284

work page arXiv 2023

[25] [25]

Radiology: Artificial Intelligence (Jul 2023), https://pubs.rsna.org/doi/10.1148/ryai.230024

Wasserthal, J., Meyer, M., Hanneman, K., et al.: TotalSegmentator: ro- bust segmentation of 104 anatomical structures in ct images. Radiology: Ar- tificial Intelligence5(5), e230024 (2023). https://doi.org/10.1148/ryai.230024, https://arxiv.org/abs/2208.05868

work page doi:10.1148/ryai.230024 2023

[26] [26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975–11986 (October 2023) 12 S. Gatidis et al

2023

[27] [27]

Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text (2020), https://arxiv.org/abs/2010.00747

work page arXiv 2020