pith. sign in

arxiv: 2606.08420 · v2 · pith:RJ7GDQZZnew · submitted 2026-06-07 · 💻 cs.CV

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Pith reviewed 2026-06-27 18:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelschest radiographsanatomical segmentationautoregressive predictionsynthetic datadomain shiftmedical imagingsample efficiency
0
0 comments X

The pith

Training vision-language models to autoregressively output anatomical segmentation masks from chest X-rays produces representations competitive with dedicated convolutional networks and more robust to real-world data shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to embed fine-grained anatomical structure into vision-language models without adding special decoder heads. Instead of global image-text alignment, the model learns to generate segmentation masks token by token using next-token prediction. Supervision comes from synthetic radiographs created by projecting CT volumes and their segmentations into 2D. This yields performance on par with U-Nets on matching data and better geometric consistency when tested on actual chest X-rays. Models pretrained this way also adapt to new localization tasks with fewer labeled examples.

Core claim

By training a pretrained vision-language model to generate anatomical segmentation masks via next-token prediction on synthetic chest radiographs derived from CT, the approach embeds explicit anatomical knowledge directly into the generative objective, achieving performance comparable to specialized convolutional models in-distribution while demonstrating improved geometric robustness under domain shift to real CXR data and better sample efficiency for novel tasks.

What carries the argument

autoregressive token-space supervision, where the model generates anatomical segmentation masks through next-token prediction instead of using task-specific decoder heads

Load-bearing premise

Forward-projected CT segmentation labels create supervision signals that are anatomically accurate and close enough in distribution to real chest radiographs for the representations to transfer effectively.

What would settle it

A direct comparison showing that a model trained only on the synthetic data underperforms a standard U-Net by a large margin on segmentation of real chest radiographs would falsify the transfer claim.

Figures

Figures reproduced from arXiv: 2606.08420 by Christian Bluethgen, Curtis Langlotz, Sergios Gatidis.

Figure 1
Figure 1. Figure 1: Overview of CheXanatomy. CT volumes with anatomic labels are projected into synthetic CXRs, and bounding boxes and segmentation masks are encoded into structured tokens. A pretrained vision–language model (based on the Paligemma ar￾chitecture) is fine-tuned to autoregressively generate anatomical bounding boxes and segmentations via next-token prediction. The trained model approaches convolutional baseline… view at source ↗
Figure 2
Figure 2. Figure 2: Segmentation performance on synthetic CT-RATE (top) and combined real￾world radiographs from CheXmask and VinDr-RibCXR (bottom). Boxplots show Dice, IoU, Hausdorff distance, and Fourier descriptor distance across model scales and input resolutions. On synthetic data, the best PaliGemma models approach U-Net in overlap metrics and achieve lower boundary and shape errors. Under domain shift, PaliGemma mainta… view at source ↗
Figure 3
Figure 3. Figure 3: Representative examples for out-of-distribution segmentation performance of anatomy-trained VLMs and a specialized U-Net compared to ground truth segmenta￾tions. Only VLMs with frozen vision encoders during fine-tuning are shown. Top row: CheXmask example (lungs and heart). Bottom row: VinDr-RibCXR example (bilateral rib structures). Compared to U-Net, anatomy-trained VLMs demonstrate more stable geometric… view at source ↗
Figure 4
Figure 4. Figure 4: Performance scaling on the RadVLM grounding benchmark comparing anatomy-pretrained and baseline models across training data fractions. Top: Success rate and mean IoU as a function of available training data (0–100%). Anatomy￾pretrained models show clear advantages in the low-data regime (0–3%), with per￾formance converging as more task-specific data becomes available. Bottom: Represen￾tative grounding exam… view at source ↗
read the original abstract

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CheXanatomy, a framework that integrates explicit anatomical knowledge into pretrained vision-language models for chest radiographs via autoregressive next-token prediction of segmentation masks. Supervision is obtained by synthesizing radiographs from CT volumes and forward-projecting CT segmentation labels into 2D masks. The model is evaluated against a U-Net baseline on both synthetic and real CXR data, with ablations on model scale, input resolution, and vision-encoder fine-tuning. The central claims are that autoregressive anatomical supervision yields performance comparable to specialized convolutional models in-distribution, improved geometric robustness under domain shift to real CXR, and better sample efficiency when adapting to novel localization tasks under limited supervision.

Significance. If the empirical claims hold after verification of the synthetic-to-real transfer, the work would advance anatomy-aware medical VLMs by demonstrating that generative supervision on projected anatomical labels can produce spatially grounded representations without task-specific decoder heads. This could improve robustness and few-shot efficiency for localization tasks in medical imaging while leveraging scalable synthetic supervision.

major comments (2)
  1. [Abstract and data-synthesis description] Abstract and data-synthesis description: the central claims of comparable in-distribution performance and improved geometric robustness under domain shift rest on the unverified assumption that forward-projected CT segmentation labels produce supervision signals that are both anatomically accurate and distributionally close to real chest radiographs. No quantitative measures of domain gap (e.g., intensity histogram divergence, projection artifact analysis) or failure-mode examination are reported, which directly undermines the transfer and robustness assertions.
  2. [Evaluation section] Evaluation section: the abstract asserts 'performance comparable to specialized convolutional models' and 'improved geometric robustness' yet supplies no quantitative metrics, error bars, Dice/IoU scores, or ablation tables, leaving the strength of the central empirical claims impossible to assess from the provided description.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., Dice score or robustness delta) to support the performance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger validation of the synthetic data pipeline and clearer presentation of quantitative results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract and data-synthesis description] Abstract and data-synthesis description: the central claims of comparable in-distribution performance and improved geometric robustness under domain shift rest on the unverified assumption that forward-projected CT segmentation labels produce supervision signals that are both anatomically accurate and distributionally close to real chest radiographs. No quantitative measures of domain gap (e.g., intensity histogram divergence, projection artifact analysis) or failure-mode examination are reported, which directly undermines the transfer and robustness assertions.

    Authors: We agree that explicit quantitative characterization of the synthetic-to-real domain gap would strengthen the manuscript. The current version relies on downstream performance on real CXR data as evidence of successful transfer and robustness, but does not report intensity histogram divergence, projection artifact statistics, or a dedicated failure-mode analysis of the forward-projection process. In revision we will add these analyses, including side-by-side intensity distribution comparisons and qualitative examples of projection artifacts. revision: yes

  2. Referee: [Evaluation section] Evaluation section: the abstract asserts 'performance comparable to specialized convolutional models' and 'improved geometric robustness' yet supplies no quantitative metrics, error bars, Dice/IoU scores, or ablation tables, leaving the strength of the central empirical claims impossible to assess from the provided description.

    Authors: The full manuscript contains the requested quantitative results: Dice and IoU scores with standard-error bars across multiple runs, direct comparisons against the U-Net baseline on both synthetic and real data, and ablation tables for model scale, input resolution, and vision-encoder fine-tuning. These appear in the Evaluation section with accompanying figures and tables. The abstract follows the conventional practice of summarizing findings at a high level without numerical values. We can expand the abstract to include key metrics if the editor prefers. revision: partial

Circularity Check

0 steps flagged

No circularity; external CT supervision and standard eval keep claims independent

full rationale

The paper trains a VLM on forward-projected CT segmentations to generate 2D masks autoregressively, then evaluates transfer to real CXR and few-shot tasks. No equations, fitted parameters, or self-citations reduce the reported performance, robustness, or sample-efficiency gains to quantities defined inside the paper. Supervision originates from independent CT volumes; evaluation uses held-out synthetic and real data. This matches the default non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CT-to-2D projections supply faithful anatomical labels for real CXR training and that next-token prediction on masks induces spatially grounded representations. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Forward projection of CT segmentation labels yields 2D masks that are anatomically consistent and sufficiently representative of real chest radiograph anatomy for model training and transfer.
    Invoked to justify scalable supervision and evaluation on real CXR data.

pith-pipeline@v0.9.1-grok · 5759 in / 1381 out tokens · 22148 ms · 2026-06-27T18:59:22.263451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Susano Pinto, A., Kolesnikov, A., Wang, X., Salz, D., Neu- mann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., Unterthiner, T., Keysers, D., Koppula, S., Liu, F., Grycner, A., Gritsenko, A., Houlsby, N., Kumar, M., Rong, K., Eisenschlos, J., Kabra, R., Bauer, M., Bošnjak, M., Chen, X., Min- derer, M., Voigtlaender, P., Bic...

  2. [2]

    Chen, T., et al.: Pix2seq: A language modeling framework for object detection (2021), https://arxiv.org/abs/2109.10852

  3. [3]

    Deperrois, N., Matsuo, H., Ruipérez-Campillo, S., Vandenhirtz, M., Laguna, S., Ryser, A., Fujimoto, K., Nishio, M., Sutter, T., Vogt, J.E., Kluckert, J., Frauen- felder, T., Blüthgen, C., Nooralahzadeh, F., Krauthammer, M.: RadVLM: A mul- titask conversational vision-language model for chest x-ray interpretation (2025), https://arxiv.org/abs/2502.03333

  4. [4]

    Dong, Z., Wu, W., Hao, J., Chen, T., Weng, Z., Zhou, B.: AnyCXR: Human anatomy segmentation of chest x-ray at any acquisition position using multi-stage domain randomized synthetic data with imperfect annotations and conditional joint annotation regularization learning (2025), https://arxiv.org/abs/2512.17263

  5. [5]

    Scientific Data11(1), 511 (2024)

    Gaggion, N., Mosquera, C., Mansilla, L., Saidman, J.M., Aineseder, M., Milone, D.H., Ferrante, E.: CheXmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientific Data11(1), 511 (2024). https://doi.org/10.1038/s41597-024-03358-1

  6. [6]

    Medical Image Analysis10(1), 19–40 (2006)

    van Ginneken, B., Stegmann, M.B., Loog, M.: Segmentation of anatomi- cal structures in chest radiographs using supervised methods: A comparative study on a public database. Medical Image Analysis10(1), 19–40 (2006). https://doi.org/10.1016/j.media.2005.02.002

  7. [7]

    In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings

    Gopalakrishnan, V., Golland, P.: Fast auto-differentiable digitally reconstructed radiographs for solving inverse problems in intraoperative imaging. In: Clinical Image-Based Procedures: 11th Workshop, CLIP 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18, 2022, Proceedings. pp. 1–11. Springer- Verlag, Berlin, Heidelberg (2022). https:/...

  8. [8]

    Gen- eralist Foundation Models from a Multimodal Dataset for 3D Computed Tomography.Nature Biomedical Engineering, 2026

    Hamamci, I.E., Er, S., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Das- delen, M.F., Durugol, O.F., Wittmann, B., Amiranashvili, T., Simsar, E., Sim- sar, M., Erdemir, E.B., Alanbay, A., Sekuboyina, A., Lafci, B., Bluethgen, C., Ozdemir, M.K., Menze, B.: Generalist foundation models from a multimodal dataset for 3D computed tomography. Nature Biome...

  9. [9]

    arXiv preprint arXiv:2406.03688 (2024)

    Hou, B., Zhu, Q., Mathai, T.S., Jin, Q., Lu, Z., Summers, R.M.: Shadow and light: Digitally reconstructed radiographs for disease classification. arXiv preprint arXiv:2406.03688 (2024)

  10. [10]

    In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=nZeVKeeFYf9

  11. [11]

    Isensee, P

    Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods (2021). https://doi.org/10.1038/s41592-020-01008-z

  12. [12]

    Kirillov, A., et al.: Segment anything (2023), https://arxiv.org/abs/2304.02643 CheXanatomy 11

  13. [13]

    Koleilat, T., Asgariandehkordi, H., Rivaz, H., Xiao, Y.: MedCLIP- SAMv2: Towards universal text-driven medical image segmentation (2024), https://arxiv.org/abs/2409.19483

  14. [14]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

    Liu, J., Zhang, Y., Chen, J.N., Xiao, J., Lu, Y., Landman, B.A., Yuan, Y., Yuille, A., Tang, Y., Zhou, Z.: CLIP-driven universal model for organ segmentation and tumor detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

  15. [15]

    Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts (2021), https://arxiv.org/abs/2112.10003

  16. [16]

    arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

    Nguyen, H.C., Le, T.T., Pham, H.H., Nguyen, H.Q.: VinDr-RibCXR: A benchmark dataset for automatic segmentation and labeling of individual ribs on chest x-rays. arXiv preprint (2021), https://arxiv.org/abs/2107.01327, arXiv:2107.01327

  17. [17]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision (2021), https://arxiv.org/abs/2103.00020

  18. [18]

    Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Le Lan, C., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., et al.: Gemma 2: Improving open language...

  19. [19]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer (2015), https://arxiv.org/abs/1505.04597

  20. [20]

    Seibold, C., Jaus, A., Fink, M.A., Kim, M., Reiß, S., Herrmann, K., Kleesiek, J., Stiefelhagen, R.: Accurate fine-grained segmentation of human anatomy in radio- graphs via volumetric pseudo-labeling (2023), https://arxiv.org/abs/2306.03934

  21. [21]

    In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

    Seibold, C.M., Reiß, S., Sarfraz, M.S., Fink, M.A., Mayer, V., Sellner, J., Kim, M.S., Maier-Hein, K.H., Kleesiek, J., Stiefelhagen, R.: Detailed annotations of chest x-rays via ct projection for report understanding. In: British Machine Vi- sion Conference (BMVC) (2022), https://bmvc2022.mpi-inf.mpg.de/0058.pdf

  22. [22]

    PaliGemma 2: A Family of Versatile VLMs for Transfer

    Steiner, A., Susano Pinto, A., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Gritsenko,A.,Minderer,M.,Sherbondy,A.,Long,S.,Qin,S.,Ingle,R.,Bugliarello, E., Kazemzadeh, S., Mesnard, T., Alabdulmohsin, I., Beyer, L., et al.: PaliGemma 2: A family of versatile vision–language models for transfer. arXiv preprint (2024), https://arxiv.org/abs/2412.03555, a...

  23. [23]

    Nature Biomedical Engineering6, 1399–1406 (2022)

    Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022). https://doi.org/10.1038/s41551-022-00936-9

  24. [24]

    Wang, X., et al.: SegGPT: Segmenting everything in context (2023), https://arxiv.org/abs/2304.03284

  25. [25]

    Radiology: Artificial Intelligence (Jul 2023), https://pubs.rsna.org/doi/10.1148/ryai.230024

    Wasserthal, J., Meyer, M., Hanneman, K., et al.: TotalSegmentator: ro- bust segmentation of 104 anatomical structures in ct images. Radiology: Ar- tificial Intelligence5(5), e230024 (2023). https://doi.org/10.1148/ryai.230024, https://arxiv.org/abs/2208.05868

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975–11986 (October 2023) 12 S. Gatidis et al

  27. [27]

    Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text (2020), https://arxiv.org/abs/2010.00747