pith. sign in

arxiv: 2605.15997 · v1 · pith:PPB33SUZnew · submitted 2026-05-15 · 💻 cs.CV

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

Pith reviewed 2026-05-20 18:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords CT image analysisvision-language modelsmedical image segmentationobject detectionexplainable AIautoregressive frameworkmultimodal dataset
0
0 comments X

The pith

A unified autoregressive model performs CT segmentation, detection, and textual reasoning together by routing tasks through vision-language tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that segmentation, detection, and explanation of CT appearances can be handled inside one autoregressive framework instead of separate models. Standard deep-learning approaches stay at image-level pattern matching and rarely supply explicit anatomical or contextual reasoning, yet clinical work often needs several kinds of output at once. The method adds task-routing tokens that read the hidden states of a large vision-language model and activate the right heads for masks, boxes, or text, then refines results with a closer-look mechanism that narrows the field of view step by step. A new multimodal dataset supplies the pixel masks, bounding boxes, prompts, and descriptions needed to train and test this joint behaviour. If the approach holds, a single pass through the model yields both accurate visual results and readable explanations without switching tools.

Core claim

The authors present a unified autoregressive framework that inserts task-routing tokens conditioned on the hidden states of a large vision-language model; these tokens activate and integrate detection and segmentation heads so that masks, bounding boxes, and structured textual reasonings are generated coherently. A closer-look mechanism then lets the model revisit regions of interest at progressively finer fields of view. Training and evaluation rest on a newly curated multimodal CT dataset that pairs pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions, all produced by AI-assisted annotation followed by human verification.

What carries the argument

Task-routing tokens conditioned on vision-language hidden states that trigger and combine detection and segmentation heads, together with a closer-look mechanism that performs progressive coarse-to-fine refinement of regions of interest.

If this is right

  • Segmentation Dice improves by up to 1.0 percent on BTCV and 1.7 percent on MosMed+ while detection and reasoning outputs are produced in the same forward pass.
  • The closer-look mechanism supplies successive refinements that raise both localization accuracy and semantic clarity.
  • The curated multimodal dataset supplies consistent training signals for masks, boxes, prompts, and textual descriptions.
  • Appearance reasoning is generated alongside visual outputs without requiring a second model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing and refinement design could be tested on other volumetric modalities such as MRI to check whether the gains transfer.
  • Clinical pipelines might use the textual outputs directly for report drafting or second-reader alerts.
  • Further chaining of reasonings could let the model answer free-form questions about a scan once the initial outputs are available.

Load-bearing premise

The assumption that task-routing tokens conditioned on the hidden states of the large vision-language model can coherently trigger and integrate detection and segmentation heads without task interference or loss of localization accuracy.

What would settle it

A controlled comparison showing whether joint multi-task performance on BTCV or MosMed+ drops below that of separate single-task models, or whether the generated textual reasonings systematically disagree with the produced masks and boxes on the same cases.

read the original abstract

Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a "closer-look" mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified autoregressive framework for CT image analysis that integrates segmentation, detection, and textual explanation. Task-routing tokens conditioned on the hidden states of a large vision-language model trigger specialized detection and segmentation heads, while a closer-look mechanism enables progressive coarse-to-fine refinement of regions of interest. A new multimodal CT dataset with pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions is curated via AI-assisted annotation with human verification. Experiments on public benchmarks report consistent improvements over state-of-the-art, with gains of up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, alongside appearance reasoning outputs.

Significance. If validated, the work could advance clinical CT interpretation by providing a single model for multi-task visual analysis plus linguistic reasoning, addressing the gap between pure pattern recognition and contextual understanding. The release of code and dataset supports reproducibility and community use. The modest scale of reported gains, however, requires clear attribution to the proposed components rather than dataset curation or model scale alone.

major comments (2)
  1. The central claim that task-routing tokens conditioned on VLM hidden states enable coherent multi-task outputs without interference rests on an untested assumption. No ablation isolates the routing tokens from standard multi-task training or the new dataset; shared hidden states during autoregressive generation and progressive FOV refinement could induce crosstalk that dilutes localization accuracy, especially given the modest reported Dice gains.
  2. Experiments section: the reported improvements (up to 1.0% Dice on BTCV, 1.7% on MosMed+) lack error bars, statistical significance tests, or training-procedure details. Without these, it is impossible to determine whether gains exceed variance or stem from the closer-look mechanism versus baseline factors.
minor comments (2)
  1. Abstract: the phrase 'consistent improvements over the SoTA' should name the specific baseline methods and exact metric values for each benchmark to allow immediate comparison.
  2. Dataset description: clarify the exact protocol for human verification of AI-assisted annotations to ensure reproducibility and rule out systematic biases in the new multimodal CT dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and constructive feedback on our unified autoregressive framework for CT image analysis. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: The central claim that task-routing tokens conditioned on VLM hidden states enable coherent multi-task outputs without interference rests on an untested assumption. No ablation isolates the routing tokens from standard multi-task training or the new dataset; shared hidden states during autoregressive generation and progressive FOV refinement could induce crosstalk that dilutes localization accuracy, especially given the modest reported Dice gains.

    Authors: While we agree that further ablations would provide stronger evidence for the specific contribution of the task-routing tokens, our current experiments demonstrate consistent improvements over baselines that do not employ the proposed routing mechanism or the closer-look refinement. The design of conditioning the routing tokens on VLM hidden states and using progressive refinement is intended to reduce potential crosstalk by allowing task-specific specialization and focused attention on regions of interest. Qualitative examples in the paper illustrate coherent multi-task outputs without evident interference. To address this concern directly, we will include an ablation study isolating the routing tokens in the revised manuscript. revision: yes

  2. Referee: Experiments section: the reported improvements (up to 1.0% Dice on BTCV, 1.7% on MosMed+) lack error bars, statistical significance tests, or training-procedure details. Without these, it is impossible to determine whether gains exceed variance or stem from the closer-look mechanism versus baseline factors.

    Authors: We acknowledge the importance of statistical rigor in reporting experimental results. In the revised version of the manuscript, we will add error bars representing standard deviations from multiple training runs, perform and report statistical significance tests (e.g., paired t-tests), and provide comprehensive details on the training procedure, hyperparameters, and computational setup. This will allow readers to better assess the reliability and source of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework and external benchmarks

full rationale

The paper introduces a novel autoregressive architecture with task-routing tokens conditioned on VLM hidden states plus a closer-look progressive refinement mechanism. These are presented as original constructions, trained on a newly curated multimodal CT dataset, and evaluated via Dice improvements on independent public benchmarks (BTCV, MosMed+). No equations, predictions, or uniqueness claims reduce by construction to fitted inputs or self-citations; the derivation chain remains self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework relies on standard vision-language model components and a new curated dataset whose construction details are not specified.

pith-pipeline@v0.9.0 · 5798 in / 1073 out tokens · 35292 ms · 2026-05-20T18:14:48.571583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 5 internal anchors

  1. [1]

    Lisa: Reasoning segmentation via large language model,

    X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9579–9589, 2024

  2. [2]

    Glamm: Pixel grounding large multimodal model,

    H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13009–13018, 2024

  3. [3]

    Medregion-ct: Region-focused multi- modal llm for comprehensive 3d ct report generation,

    S. Kyung, J. Seo, H. Lim, D. Kim, H. Park, J. Sung, J. Kim, W. Jo, Y . Nam, and N. Kim, “Medregion-ct: Region-focused multi- modal llm for comprehensive 3d ct report generation,”arXiv preprint arXiv:2506.23102, 2025

  4. [4]

    A comprehensive review of performance metrics for computer- aided detection systems,

    D. Park, “A comprehensive review of performance metrics for computer- aided detection systems,”Bioengineering, vol. 11, no. 11, p. 1165, 2024

  5. [5]

    Ai-powered object detection in radiology: Current models, challenges, and future direction,

    A. Elhanashi, S. Saponara, Q. Zheng, N. Almutairi, Y . Singh, S. Kuanar, F. Ali, O. Unal, and S. Faghani, “Ai-powered object detection in radiology: Current models, challenges, and future direction,”Journal of Imaging, vol. 11, no. 5, p. 141, 2025

  6. [6]

    A systematic review of yolo-based object detection in medical imaging: Advances, challenges, and future direc- tions,

    Z. Cai, K. Zhou, and Z. Liao, “A systematic review of yolo-based object detection in medical imaging: Advances, challenges, and future direc- tions,”Computers, Materials and Continua, vol. 85, no. 2, pp. 2255– 2303, 2025

  7. [7]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017

  8. [8]

    Retina u-net: Embarrassingly simple exploitation of segmentation supervision for medical object detection,

    P. F. Jaeger, S. A. Kohl, S. Bickelhaupt, F. Isensee, T. A. Kuder, H.- P. Schlemmer, and K. H. Maier-Hein, “Retina u-net: Embarrassingly simple exploitation of segmentation supervision for medical object detection,” inMachine learning for health workshop, pp. 171–183, PMLR, 2020

  9. [9]

    Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation,

    C.-Y . Li, K.-J. Chang, C.-F. Yang, H.-Y . Wu, W. Chen, H. Bansal, L. Chen, Y .-P. Yang, Y .-C. Chen, S.-P. Chen,et al., “Towards a holistic framework for multimodal llm in 3d brain ct radiology report generation,”Nature Communications, vol. 16, no. 1, p. 2258, 2025

  10. [10]

    Automatic medical report generation based on deep learning: A state of the art survey,

    X. Liu, J. Xin, Q. Shen, Z. Huang, and Z. Wang, “Automatic medical report generation based on deep learning: A state of the art survey,” Computerized Medical Imaging and Graphics, p. 102486, 2025

  11. [11]

    Ai in proton therapy treatment planning: A review,

    Y . Ding, H. Feng, M. Bues, M. Fatyga, T. Liu, T. J. Whitaker, H. Lin, N. Y . Lee, C. B. Simone II, S. H. Patel,et al., “Ai in proton therapy treatment planning: A review,”arXiv preprint arXiv:2510.19213, 2025

  12. [12]

    Computer-extracted global radiomic features can predict the radiolo- gists’ first impression about the abnormality of a screening mammo- gram,

    S. Siviengphanom, S. J. Lewis, P. C. Brennan, and Z. Gandomkar, “Computer-extracted global radiomic features can predict the radiolo- gists’ first impression about the abnormality of a screening mammo- gram,”British Journal of Radiology, vol. 97, no. 1153, pp. 168–179, 2024

  13. [13]

    Visual search in breast imaging,

    Z. Gandomkar and C. Mello-Thoms, “Visual search in breast imaging,” The British journal of radiology, vol. 92, no. 1102, p. 20190057, 2019

  14. [14]

    Re- liability of radiologists’ first impression when interpreting a screening mammogram,

    Z. Gandomkar, S. Siviengphanom, M. Suleiman, D. Wong, W. Reed, E. U. Ekpo, D. Xu, S. J. Lewis, K. K. Evans, J. M. Wolfe,et al., “Re- liability of radiologists’ first impression when interpreting a screening mammogram,”Plos one, vol. 18, no. 4, p. e0284605, 2023

  15. [15]

    Interpreting chest radiographs without visual search,

    H. L. Kundel and C. F. Nodine, “Interpreting chest radiographs without visual search,”Radiology, vol. 116, no. 3, pp. 527–532, 1975

  16. [16]

    Holistic component of image perception in mammogram interpretation: gaze- tracking study,

    H. L. Kundel, C. F. Nodine, E. F. Conant, and S. P. Weinstein, “Holistic component of image perception in mammogram interpretation: gaze- tracking study,”Radiology, vol. 242, no. 2, pp. 396–402, 2007

  17. [17]

    Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,

    B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” inProc. MICCAI multi-atlas labeling beyond cranial vault—workshop challenge, vol. 5, p. 12, Munich, Germany, 2015

  18. [18]

    arXiv preprint arXiv:2005.06465 (2020)

    S. P. Morozov, A. E. Andreychenko, N. A. Pavlov, A. V . Vladzymyrskyy, N. V . Ledikhova, V . A. Gombolevskiy, I. A. Blokhin, P. B. Gelezhe, A. V . Gonchar, and V . Y . Chernina, “Mosmeddata: Chest ct scans with covid- 19 related findings dataset,”arXiv preprint arXiv:2005.06465, 2020

  19. [19]

    A comparison of pre-trained vision- and-language models for multimodal representation learning across medical images and reports,

    Y . Li, H. Wang, and Y . Luo, “A comparison of pre-trained vision- and-language models for multimodal representation learning across medical images and reports,” in2020 IEEE international conference on bioinformatics and biomedicine (BIBM), pp. 1999–2004, IEEE, 2020

  20. [20]

    A survey of medical vision-and-language applications and their techniques,

    Q. Chen, R. Zhao, S. Wang, V . M. H. Phan, A. v. d. Hengel, J. Verjans, Z. Liao, M.-S. To, Y . Xia, J. Chen,et al., “A survey of medical vision-and-language applications and their techniques,”arXiv preprint arXiv:2411.12195, 2024

  21. [21]

    Bridging the pathology domain gap: Efficiently adapting clip for pathology image analysis with limited labeled data,

    Z. Lai, J. Chauhan, B. N. Dugger, and C.-N. Chuah, “Bridging the pathology domain gap: Efficiently adapting clip for pathology image analysis with limited labeled data,” inEuropean Conference on Com- puter Vision, pp. 256–273, Springer, 2024

  22. [22]

    Medi- clip: Adapting clip for few-shot medical image anomaly detection,

    X. Zhang, M. Xu, D. Qiu, R. Yan, N. Lang, and X. Zhou, “Medi- clip: Adapting clip for few-shot medical image anomaly detection,” in International Conference on Medical Image Computing and Computer- Assisted Intervention, pp. 458–468, Springer, 2024

  23. [23]

    Clip-driven universal model for organ segmentation and tumor detection,

    J. Liu, Y . Zhang, J.-N. Chen, J. Xiao, Y . Lu, B. A Landman, Y . Yuan, A. Yuille, Y . Tang, and Z. Zhou, “Clip-driven universal model for organ segmentation and tumor detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21152–21164, 2023

  24. [24]

    Medclip-sam: Bridging text and image towards universal medical image segmentation,

    T. Koleilat, H. Asgariandehkordi, H. Rivaz, and Y . Xiao, “Medclip-sam: Bridging text and image towards universal medical image segmentation,” inInternational conference on medical image computing and computer- assisted intervention, pp. 643–653, Springer, 2024

  25. [25]

    Cp-clip: Core-periphery feature alignment clip for zero-shot medical image analysis,

    X. Yu, Z. Wu, L. Zhang, J. Zhang, Y . Lyu, and D. Zhu, “Cp-clip: Core-periphery feature alignment clip for zero-shot medical image analysis,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 88–97, Springer, 2024

  26. [26]

    Causalclipseg: Unlocking clip’s potential in referring medical image segmentation with causal intervention,

    Y . Chen, M. Wei, Z. Zheng, J. Hu, Y . Shi, S. Xiong, X. X. Zhu, and L. Mou, “Causalclipseg: Unlocking clip’s potential in referring medical image segmentation with causal intervention,” inInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pp. 77–87, Springer, 2024

  27. [27]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

  28. [28]

    Align before fuse: Vision and language representation learning with momentum distillation,

    J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,”Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021

  29. [29]

    Mm- llms: Recent advances in multimodal large language models,

    D. Zhang, Y . Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, “Mm- llms: Recent advances in multimodal large language models,”Findings of the Association for Computational Linguistics: ACL 2024, pp. 12401– 12430, 2024

  30. [30]

    Gsva: Generalized segmentation via multimodal large language models,

    Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3858–3869, 2024

  31. [31]

    SAM 2: Segment Anything in Images and Videos

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson,et al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  32. [32]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026, 2023

  33. [33]

    Visionllm v2: An end-to-end generalist multimodal large language model,

    W. Wang, Z. Chen, X. Chen,et al., “Visionllm v2: An end-to-end generalist multimodal large language model,”arXiv:2406.08394, 2024

  34. [34]

    Llava-med: Training a large multimodal model for medical visual question answering,

    F. Liu, Y . Zhang, H. Yan, Y . Xu, Y . Tang, Z. Tao, X. He, Y . Tang, X. Fu, M. Zhao,et al., “Llava-med: Training a large multimodal model for medical visual question answering,”arXiv preprint arXiv:2310.14470, 2023

  35. [35]

    Pmc-llava: Towards multimodal vision-language foundation models for biomedical ai,

    M. Chen, F. Liu, X. He, X. Yuan, Y . Tang, X. Wang, T. He, J. Tang, and X. Zhang, “Pmc-llava: Towards multimodal vision-language foundation models for biomedical ai,”arXiv preprint arXiv:2402.09368, 2024

  36. [36]

    Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal out- put,

    Y . Chen, D. Xu, Y . Huang, S. Zhan, H. Wang, D. Chen, X. Wang, M. Qiu, and H. Li, “Mimo: A medical vision language model with visual referring multimodal input and pixel grounding multimodal out- put,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 24732–24741, 2025

  37. [37]

    M3d: Advancing 3d medical image analysis with multi-modal large language models,

    F. Bai, Y . Du, T. Huang, M. Q.-H. Meng, and B. Zhao, “M3d: Advancing 3d medical image analysis with multi-modal large language models,” arXiv preprint arXiv:2404.00578, 2024

  38. [38]

    Ct2rep: Automated radiology report generation for 3d medical imaging,

    I. E. Hamamci, S. Er, and B. Menze, “Ct2rep: Automated radiology report generation for 3d medical imaging,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 476–486, Springer, 2024

  39. [39]

    Artificial intelligence in radiology,

    A. Hosny, C. Parmar, J. Quackenbush, L. H. Schwartz, and H. J. Aerts, “Artificial intelligence in radiology,”Nature Reviews Cancer, vol. 18, no. 8, pp. 500–510, 2018

  40. [40]

    Explainable ai in medical imaging: An overview for clinical practitioners–beyond saliency-based xai ap- proaches,

    K. Borys, Y . A. Schmitt, M. Nauta, C. Seifert, N. Kr ¨amer, C. M. Friedrich, and F. Nensa, “Explainable ai in medical imaging: An overview for clinical practitioners–beyond saliency-based xai ap- proaches,”European journal of radiology, vol. 162, p. 110786, 2023

  41. [41]

    Vip-llava: Making large multimodal models understand 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 44, NO. 2, FEBRUARY 2025 arbitrary visual prompts,

    M. Cai, H. Liu, S. K. Mustikovela, G. P. Meyer, Y . Chai, D. Park, and Y . J. Lee, “Vip-llava: Making large multimodal models understand 10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 44, NO. 2, FEBRUARY 2025 arbitrary visual prompts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12914–12923, 2024

  42. [42]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    H. Shen, P. Liu, J. Li, C. Fang, Y . Ma, J. Liao, Q. Shen, Z. Zhang, K. Zhao, Q. Zhang,et al., “Vlm-r1: A stable and generalizable r1-style large vision-language model,”arXiv preprint arXiv:2504.07615, 2025

  43. [43]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,

    F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,”Nature methods, vol. 18, no. 2, pp. 203–211, 2021

  44. [44]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    J. Chen, Y . Lu, Q. Yu, X. Luo, E. Adeli, Y . Wang, L. Lu, A. L. Yuille, and Y . Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,”arXiv preprint arXiv:2102.04306, 2021

  45. [45]

    Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProceedings of the European conference on computer vision (ECCV), pp. 801–818, 2018

  46. [46]

    Lvit: language meets vision transformer in medical image segmentation,

    Z. Li, Y . Li, Q. Li, P. Wang, D. Guo, L. Lu, D. Jin, Y . Zhang, and Q. Hong, “Lvit: language meets vision transformer in medical image segmentation,”IEEE transactions on medical imaging, vol. 43, no. 1, pp. 96–107, 2023

  47. [47]

    nnde- tection: a self-configuring method for medical object detection,

    M. Baumgartner, P. F. J ¨ager, F. Isensee, and K. H. Maier-Hein, “nnde- tection: a self-configuring method for medical object detection,” in International conference on medical image computing and computer- assisted intervention, pp. 530–539, Springer, 2021

  48. [48]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, 2020

  49. [49]

    Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,

    Y . Xie, J. Zhang, C. Shen, and Y . Xia, “Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation,” inInternational conference on medical image computing and computer-assisted inter- vention, pp. 171–180, Springer, 2021

  50. [50]

    nnformer: Interleaved transformer for volumetric segmentation.arXiv preprint arXiv:2109.03201, 2021

    H.-Y . Zhou, J. Guo, Y . Zhang, L. Yu, L. Wang, and Y . Yu, “nnformer: Interleaved transformer for volumetric segmentation,”arXiv preprint arXiv:2109.03201, 2021

  51. [51]

    Unetr: Transformers for 3d medical image segmentation,

    A. Hatamizadeh, Y . Tang, V . Nath, D. Yang, A. Myronenko, B. Land- man, H. R. Roth, and D. Xu, “Unetr: Transformers for 3d medical image segmentation,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 574–584, 2022

  52. [52]

    Swinunetr-v2: Stronger swin transformers with stagewise convolu- tions for 3d medical image segmentation,

    Y . He, V . Nath, D. Yang, Y . Tang, A. Myronenko, and D. Xu, “Swinunetr-v2: Stronger swin transformers with stagewise convolu- tions for 3d medical image segmentation,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 416–426, Springer, 2023

  53. [53]

    Contrastive learning of medical visual representations from paired images and text,

    Y . Zhang, H. Jiang, Y . Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” inMachine learning for healthcare conference, pp. 2– 25, PMLR, 2022

  54. [54]

    Tganet: Text-guided attention for improved polyp segmentation,

    N. K. Tomar, D. Jha, U. Bagci, and S. Ali, “Tganet: Text-guided attention for improved polyp segmentation,” inInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pp. 151–160, Springer, 2022

  55. [55]

    Gloria: A multimodal global-local representation learning framework for label- efficient medical image recognition,

    S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “Gloria: A multimodal global-local representation learning framework for label- efficient medical image recognition,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 3942–3951, 2021

  56. [56]

    Vilt: Vision-and-language transformer without convolution or region supervision,

    W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” inInternational conference on machine learning, pp. 5583–5594, PMLR, 2021

  57. [57]

    Lavt: Language-aware vision transformer for referring image segmentation,

    Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18155–18165, 2022

  58. [58]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”arXiv preprint arXiv:2501.17811, 2025

  59. [59]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  60. [60]

    Visionreasoner: Unifying vision-language reasoning and perception tasks with large multimodal models,

    Y . Wang, H. Li, Z. Chen, K. Zhou, Y . Qiao, L. Lu, and J. Dai, “Visionreasoner: Unifying vision-language reasoning and perception tasks with large multimodal models,” inProceedings of the European Conference on Computer Vision (ECCV), 2024