pith. sign in

arxiv: 2605.01266 · v1 · submitted 2026-05-02 · 💻 cs.CV

Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation

Pith reviewed 2026-05-09 15:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot segmentationvision-language modelsNSCLC tumor delineationprompt alignmentanatomical locationclinical factor decompositionDice similarity coefficient
0
0 comments X

The pith

Anatomical location dominates prompt alignment in zero-shot VLMs for NSCLC tumor segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests which pieces of clinical information in a text prompt actually steer a zero-shot vision-language model when it draws tumor boundaries in lung CT scans. It decomposes prompts into separate clinical categories and perturbs each one while measuring how much the output mask changes. The results show that changing the described anatomical site ruins performance far more often than changing diagnosis, stage, or histology. This matters because it reveals that these models can reach accuracy levels close to trained networks by simply being told where to look, without any task-specific training.

Core claim

VoxTell, operating fully zero-shot, reached a mean Dice score of 0.613 on an internal NSCLC dataset, statistically indistinguishable from nnUNet (0.690) and a prior zero-shot method (0.675). Sub-prompt tests found that 63.4 percent of anatomical-location perturbations produced catastrophic drops, prompt specificity raised performance except when only diagnosis was supplied, irrelevant prompts produced zero output, and swapping prompts across patients gave matched Dice of 0.906 versus mismatched Dice of 0.406. Histology and stage changes had little effect, indicating the model conditions on spatial location more than on tumor identity.

What carries the argument

sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls, plus attribute-wise perturbation robustness and cross-case prompt swaps

If this is right

  • Zero-shot segmentation VLMs should be assessed on the clinical dimensions to which their attention aligns, not solely by overall Dice score.
  • Precise anatomical descriptors in prompts are sufficient to achieve competitive performance without fine-tuning.
  • Patient-specific conditioning can be achieved at inference time by supplying case-matched prompts rather than generic templates.
  • Histology and staging details add little value for spatial localization tasks in this setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt templates for future medical VLMs could be reduced to location-centric descriptions to improve reliability and reduce engineering effort.
  • The same location-priority pattern may appear in other zero-shot segmentation tasks such as brain or liver lesions if anatomical context is similarly dominant.
  • Automated prompt generators that extract and insert precise anatomical phrases from radiology reports could further close the gap to supervised models.

Load-bearing premise

The sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls accurately captures the independent effects of each clinical factor on the model's behavior without interactions or prompt engineering artifacts.

What would settle it

A controlled test that swaps only the anatomical-location phrase between matched patient cases and measures whether Dice falls from ~0.9 to ~0.4 while all other prompt elements remain fixed.

Figures

Figures reproduced from arXiv: 2605.01266 by Cosmin Ciausu, Hugo Aerts, Marion Tonneau, Raymond Mak, Suraj Pai, Thibault Heintz.

Figure 1
Figure 1. Figure 1: Prompt perturbation robustness. ∆DSC distribution by perturbation category. Location swaps cause catastrophic degra￾dation (63.4% with |∆DSC| > 0.5); tumor-type and stage swaps are largely benign. DSC=0.934) but failing for rare types (e.g., “leiomyosar￾coma,” DSC=0.000). This gap between generic and full prompts indicates that VoxTell is not simply acting as a prompt-agnostic tumor trigger, yet the variab… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt specificity ladder. DSC shows near-monotonic view at source ↗
Figure 4
Figure 4. Figure 4: Per-case DSC distributions for all nine models. VoxTell view at source ↗
Figure 3
Figure 3. Figure 3: Cross-case prompt swap: 5 × 5 DSC matrix. Diago￾nal (matched) entries substantially outperform off-diagonal (mis￾matched) entries. Generic “lung tumor” prompt shown in right￾most column. vs. 0.406 ± 0.441 for mismatched pairs, with 44% of mis￾matches producing zero output ( view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise DSC differences relative to VoxTell. Positive view at source ↗
read the original abstract

Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell's spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes "where to look" over "what to look for." In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript explores the alignment of prompts with clinical factors in the zero-shot segmentation VLM VoxTell for NSCLC tumor segmentation. Using sub-prompt decomposition into categories like diagnosis, demographic, staging, anatomical, generic, and irrelevant controls, along with perturbation robustness tests, specificity ladders, and cross-case swaps, the authors find that anatomical location dominates the model's spatial attention, with 63.4% of location perturbations causing catastrophic DSC drops. They report VoxTell achieving a mean DSC of 0.613 on a held-out dataset, statistically comparable to fine-tuned nnUNet (0.690) and Ahmed et al. (0.675), and superior to other zero-shot models. The work concludes that VLMs should be evaluated on prompt dimension alignment in addition to Dice scores.

Significance. If the results hold, this empirical study contributes by illuminating how zero-shot VLMs prioritize anatomical cues over other clinical details in medical segmentation, with implications for prompt engineering and evaluation protocols. Strengths include the multi-faceted probing approach (decomposition, swaps, specificity ladders), statistical testing with Benjamini-Hochberg correction, and direct benchmarking against both supervised and zero-shot baselines on held-out data, plus the use of cross-case swaps to demonstrate patient-specific conditioning.

major comments (2)
  1. [Methods (Sub-prompt Decomposition)] Methods (Sub-prompt Decomposition and Attribute-wise Perturbation): The claim that anatomical location is the dominant driver (63.4% of location perturbations caused catastrophic drops) rests on the assumption that the decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls isolates independent effects. Semantic entanglements in clinical language (e.g., histology descriptions implicitly constraining plausible locations or stages) are not fully ruled out by the cross-case swaps or irrelevant controls, which could confound attribution of performance changes specifically to location and weaken the 'prioritizes where to look over what to look for' conclusion.
  2. [Results (Benchmarking)] Results (Benchmarking and Statistical Tests): The reported mean DSC of 0.613 for VoxTell being statistically indistinguishable from nnUNet (adjusted p=0.156) and Ahmed et al. (adjusted p=0.679) is central to positioning the zero-shot model as competitive, but the manuscript provides limited detail on test set size, exact definition of 'catastrophic' drops, and perturbation wording controls, making it difficult to assess whether post-hoc choices or dataset biases affect these comparisons.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'catastrophic drops' is used without an explicit quantitative threshold (e.g., DSC below a specific value); define this term in the abstract or early methods for clarity.
  2. [Introduction] Introduction: Include a brief description or citation for the VoxTell model architecture and training to aid readers unfamiliar with the specific VLM variant.
  3. [Figures and Tables] Figures/Tables: Ensure all result tables and figures reporting DSC values include error bars, sample sizes per condition, and clear legends distinguishing perturbation types.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major point below with point-by-point responses. Revisions have been made to improve transparency and acknowledge limitations where the concerns are valid.

read point-by-point responses
  1. Referee: [Methods (Sub-prompt Decomposition)] Methods (Sub-prompt Decomposition and Attribute-wise Perturbation): The claim that anatomical location is the dominant driver (63.4% of location perturbations caused catastrophic drops) rests on the assumption that the decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls isolates independent effects. Semantic entanglements in clinical language (e.g., histology descriptions implicitly constraining plausible locations or stages) are not fully ruled out by the cross-case swaps or irrelevant controls, which could confound attribution of performance changes specifically to location and weaken the 'prioritizes where to look over what to look for' conclusion.

    Authors: We thank the referee for highlighting this potential limitation in our attribution approach. We agree that semantic entanglements in clinical language make complete isolation of effects challenging, and our cross-case swaps and irrelevant controls, while supportive of patient-specific and prompt-sensitive behavior (matched DSC 0.906 vs. mismatched 0.406; irrelevant prompts yielding zero segmentation), do not fully eliminate possible confounds from correlated attributes such as histology and location. In the revised manuscript, we have added an explicit discussion of this issue as a limitation in the Discussion section, including how it tempers the strength of the 'where to look over what to look for' interpretation. We have also expanded the Methods to include additional examples of prompt wording and noted that future work could explore more orthogonal synthetic prompts. revision: partial

  2. Referee: [Results (Benchmarking)] Results (Benchmarking and Statistical Tests): The reported mean DSC of 0.613 for VoxTell being statistically indistinguishable from nnUNet (adjusted p=0.156) and Ahmed et al. (adjusted p=0.679) is central to positioning the zero-shot model as competitive, but the manuscript provides limited detail on test set size, exact definition of 'catastrophic' drops, and perturbation wording controls, making it difficult to assess whether post-hoc choices or dataset biases affect these comparisons.

    Authors: We appreciate the referee's emphasis on methodological transparency for the benchmarking claims. We have revised the manuscript to include the test set size for the held-out internal dataset, the precise definition of 'catastrophic' drops (a DSC reduction of more than 50% relative to the full-prompt baseline), and the standardized wording templates and controls used for all perturbations (now detailed with examples in the Methods and Supplementary Table S1). These additions, along with the already-reported use of Wilcoxon signed-rank tests with Benjamini-Hochberg correction, should enable readers to better evaluate potential biases or post-hoc influences in the comparisons to nnUNet and Ahmed et al. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or fitted predictions

full rationale

The paper conducts direct experiments on held-out data using sub-prompt decomposition, attribute-wise perturbations, specificity ladders, cross-case swaps, and statistical tests (DSC, Wilcoxon, Benjamini-Hochberg). No equations, parameter fitting, self-citations as load-bearing premises, or renamings of known results appear in the derivation chain. All claims (e.g., 63.4% catastrophic drops on location perturbations, DSC 0.613 vs. baselines) are measured outcomes, not reductions to inputs by construction. The central findings rest on external data and controls rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper's claims depend on the assumption that the held-out internal dataset is representative of general NSCLC cases and that prompt perturbations isolate causal factors in the model's decision process without confounding effects.

axioms (2)
  • domain assumption The held-out internal dataset is representative of general NSCLC cases
    Used for benchmarking the model performance and statistical comparisons.
  • ad hoc to paper Prompt perturbations can be performed without introducing unintended biases in the VLM input
    Central to the alignment analysis and perturbation robustness tests.

pith-pipeline@v0.9.0 · 5623 in / 1401 out tokens · 55641 ms · 2026-05-09T15:03:13.539652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 10 canonical work pages

  1. [1]

    Can foundation models really segment tumors? a bench- marking odyssey in lung CT imaging.arXiv preprint arXiv:2505.01239, 2025

    Elena Mulero Ayll ´on, Massimiliano Mantegna, Linlin Shen, Paolo Soda, Valerio Guarrasi, and Matteo Tortora. Can foundation models really segment tumors? a bench- marking odyssey in lung CT imaging.arXiv preprint arXiv:2505.01239, 2025

  2. [2]

    Assessing interobserver variability in the delineation of structures in radiation oncology: A systematic review.In- ternational Journal of Radiation Oncology, Biology, Physics,

    Leslie Guzene, Arnaud Beddok, Christophe Nioche, Romain Modzelewski, Cedric Loiseau, Julia Salleron, and Juliette Thariat. Assessing interobserver variability in the delineation of structures in radiation oncology: A systematic review.In- ternational Journal of Radiation Oncology, Biology, Physics,

  3. [3]

    doi: 10.1016/j.ijrobp.2022.10.024

  4. [4]

    He et al

    Yufan He, Pengfei Guo, Yucheng Tang, Andriy Myronenko, Vishwesh Nath, Ziyue Xu, Dong Yang, Can Zhao, Ben- jamin Simon, Mason Belue, Stephanie Harmon, Baris Turk- bey, Daguang Xu, and Wenqi Li. VISTA3D: A unified seg- mentation foundation model for 3D medical imaging.arXiv preprint arXiv:2406.05285, 2024

  5. [5]

    Clinical validation of deep learning algorithms for radiotherapy targeting of non- small-cell lung cancer: an observational study.The Lancet Digital Health, 4(9):e657–e666, 2022

    Ahmed Hosny, Danielle S Bitterman, Christian V Guthier, Jack M Qian, Hannah Roberts, Subha Perni, Anurag Saraf, Luke C Peng, Itai Pashtan, Zezhong Ye, Benjamin H Kann, David E Kozono, David Christiani, Paul J Catalano, Hugo J W L Aerts, and Raymond H Mak. Clinical validation of deep learning algorithms for radiotherapy targeting of non- small-cell lung ca...

  6. [6]

    CAT: Coordinating anatomical-textual prompts for multi-organ and tumor seg- mentation

    Zhongzhen Huang, Yankai Jiang, Rongzhao Zhang, Shaot- ing Zhang, and Xiaofan Zhang. CAT: Coordinating anatomical-textual prompts for multi-organ and tumor seg- mentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.07085

  7. [7]

    nnU-Net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

    Fabian Isensee, Paul F Jaeger, Simon A A Kohl, Jens Pe- tersen, and Klaus H Maier-Hein. nnU-Net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021. doi: 10.1038/ s41592-020-01008-z

  8. [8]

    LLM-driven multimodal target volume contouring in radiation oncology

    Yujin Oh, Sangjoon Park, Hwa Kyung Byun, Yeona Cho, Ik Jae Lee, Jin Sung Kim, and Jong Chul Ye. LLM-driven multimodal target volume contouring in radiation oncology. Nature Communications, 15:9186, 2024. doi: 10.1038/ s41467-024-53387-y

  9. [9]

    Automated detection and segmentation of non-small cell lung cancer computed tomography images.Nature Communications, 13: 3423, 2022

    Sergey Primakov, Avishek Chatterjee, Inigo Bermejo, Arthur Jochems, Janita E van Timmeren, Henry C Woodruff, Ab- dalla Ibrahim, Martin Valli `eres, Cary Oberije, Coen Hurk- mans, Esther G C Troost, and Philippe Lambin. Automated detection and segmentation of non-small cell lung cancer computed tomography images.Nature Communications, 13: 3423, 2022. doi: ...

  10. [10]

    Data13, 10.1038/ s41597-025-06343-4 (2025)

    Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebas- tian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, To- bias Norajitra, and Klaus Maier-Hein. V oxtell: Free-text promptable universal 3d medical image segmentation, 2025. URLhttps://arxiv.org/abs/2511.11450

  11. [11]

    TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images

    Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, To- bias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, Michael Bach, and Martin Segeroth. TotalSegmentator: Robust seg- mentation of 104 anatomic structures in CT images.Ra- diology: Artificial Intelligence, 5(5):e230024, 2023. doi: 10.1148/ryai.230024

  12. [12]

    2024 , journal =

    Theodore Zhao, Yu Gu, Jianwei Yang, Naoto Usuyama, Ho Hin Lee, Tristan Naumann, Jianfeng Gao, Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Pien- ing, Carlo Bifulco, Mu Wei, Hoifung Poon, and Sheng Wang. BiomedParse: a biomedical foundation model for image parsing of everything everywhere all at once.Na- ture Methods, 2025. doi: 10.1038/s41592-0...

  13. [13]

    One model to rule them all: To- wards universal segmentation for medical images with text prompt,

    Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, and Weidi Xie. Large-vocabulary segmentation for medical images with text prompts.arXiv preprint arXiv:2312.17183, 2023