pith. sign in

arxiv: 2504.03476 · v2 · submitted 2025-04-04 · 💻 cs.CV

Anatomy-Aware Text-Visual Fusion with Dual-Perspective Prompts for Fine-Grained Lumbar Spine Segmentation

Pith reviewed 2026-05-22 20:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords lumbar spine segmentationanatomy-aware promptstext-visual fusionmulti-modal learningfine-grained segmentationmedical image analysiscontrastive learningMRI spine
0
0 comments X

The pith

ATM-Net fuses anatomy-aware text prompts with images to improve fine-grained lumbar spine segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that visual-only models fall short for precise lumbar spine segmentation because they miss anatomical semantics, leading to category mix-ups and blurry details. ATM-Net counters this by generating text prompts from annotations in multiple views, then merging them with image features to create richer context for vertebrae, discs, and the spinal canal. A contrastive module sharpens class boundaries at the channel level. If the approach holds, it would deliver higher Dice scores and tighter boundary errors than prior methods on standard MRI datasets.

Core claim

ATM-Net is an anatomy-aware text-guided multi-modal fusion framework that uses the Anatomy-aware Text Prompt Generator to turn image annotations into prompts across views, the Holistic Anatomy-aware Semantic Fusion module to combine them with image features for comprehensive anatomical context, and the Channel-wise Contrastive Anatomy-Aware Enhancement module to boost class discrimination via multi-modal contrastive learning, resulting in finer segmentation of vertebrae, intervertebral discs, and spinal canal.

What carries the argument

The anatomy-aware text-visual fusion mechanism that converts annotations into prompts and integrates them with image features through dedicated fusion and contrastive modules.

If this is right

  • Higher Dice scores and lower boundary errors on datasets like SPIDER and MRSpineSeg.
  • Fewer misclassifications among vertebrae, discs, and spinal canal categories.
  • More accurate capture of fine segmentation details needed for spinal disorder diagnosis.
  • Consistent gains in class discrimination through channel-level contrastive learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prompt-generation step could be reused on other bony or soft-tissue structures where annotations exist but semantic context is weak.
  • The multi-view prompt strategy might reduce reliance on massive labeled sets by injecting prior anatomical knowledge.
  • Extending the fusion to CT or ultrasound data could test whether the same text-visual pairing improves segmentation in mixed-modality clinics.

Load-bearing premise

The method assumes that turning image annotations into anatomy-aware text prompts and fusing them with visual features will add useful context and sharpen discrimination without introducing offsetting errors or biases.

What would settle it

A direct comparison on a held-out MRI dataset where ATM-Net's Dice score and HD95 do not exceed those of the strongest visual baseline such as SpineParseNet would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2504.03476 by Dengfeng Pan, Fan Zhang, Guang-Yong Chen, Guodong Fan, Hao Xu, Jianlong Cai, Sheng Lian, Shuo Li.

Figure 1
Figure 1. Figure 1: (a) Task definition on the fine-grained segmentation of lumbar spine MRI. (b) Task challenges in various aspects. (c) The design [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. ATPG adaptively converts image annotation into anatomy-aware text prompts. These insights are integrated with visual features via HASF, building a comprehensive anatomical context. CCAE further enhances class discrimination and segmenta￾tion details through class-wise channel-level multi-modal contrastive learning. Best viewed in color. techniques from CV and NLP communities [25]. This sec… view at source ↗
Figure 3
Figure 3. Figure 3: The process of text prompt generation in ATPG. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The t-SNE visualization of em￾bedding space on both datasets for Swin UNETR and our ATM-Net. Method S L5 L4 L3 L2 L1 T12 T11 T10 T9 L5/S L4/L5 L3/L4 L2/L3 L1/L2 T12/L1 T11/T12 T10/T11 T9/T10 Avg. U-Net 82.31 75.3 60.96 53.87 51.36 53.2 57.21 63.43 40.53 18.3 80 76.97 73.34 67.43 66.98 69.81 64.73 57.3 0.19 58.59 UNETR 80.68 72.14 64.8 64.72 62.08 61.21 65.02 71.54 0 53.69 74.43 71.08 73.36 72.61 72.47 72.5… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons between ATM-Net and the comparing methods across two datasets. We also provide zoom-in views with dashed boxes: red concerning class discrimination and green for segmentation details. Best viewed in color. the DSC of 79.39% and the Jaccard of 70.56%, signifi￾cantly surpassing the ones of Swin UNETR by 12.72% and 11.25%, respectively. These results show that integrating clinical text… view at source ↗
Figure 6
Figure 6. Figure 6: Different prompt selections: from Opt.1 to Opt.3, the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ATM-Net, a multi-modal architecture for fine-grained lumbar spine segmentation (vertebrae, intervertebral discs, spinal canal) that uses the Anatomy-aware Text Prompt Generator (ATPG) to convert image annotations into dual-perspective anatomy-aware prompts, fuses them with visual features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, and refines class discrimination with the Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module. It reports consistent outperformance over prior methods on the MRSpineSeg and SPIDER datasets, including Dice of 79.39% and HD95 of 9.91 pixels on SPIDER (8.31% and 4.14 pixels better than SpineParseNet).

Significance. If the gains are shown to arise from the fusion modules rather than privileged label information, the work would provide evidence that text-visual integration can improve anatomical context and class separation in medical segmentation tasks. The empirical results on two datasets indicate potential clinical utility for more precise spinal disorder diagnosis, though the absence of open code or parameter details limits immediate reproducibility.

major comments (2)
  1. [Abstract] Abstract: the central claim that ATM-Net 'builds a comprehensive anatomical context' and 'enhances class discrimination' via ATPG, HASF, and CCAE rests on the assumption that prompts are generated without ground-truth segmentation masks. The description of ATPG as converting 'image annotations' into prompts leaves open whether these are training-time labels; if ground-truth masks are used, the 8.31% Dice and 4.14-pixel HD95 gains on SPIDER become non-comparable to visual-only baselines such as SpineParseNet and cannot be attributed to the proposed fusion mechanism.
  2. [Methods] Methods (ATPG, HASF, CCAE descriptions): no explicit statement clarifies whether anatomy-aware prompts are available at inference time or only during training, nor whether the contrastive learning in CCAE uses paired text-image features derived from labels. This detail is load-bearing for the claim of 'consistent improvements regarding class discrimination' and must be resolved to evaluate the architecture's contribution.
minor comments (2)
  1. [Abstract] Abstract and Experiments: the manuscript reports aggregate Dice/HD95 but provides no per-class breakdown, statistical significance tests, or ablation isolating ATPG vs. HASF vs. CCAE contributions.
  2. [Experiments] The paper would benefit from a clear statement on training/inference protocol for the text prompts and release of code or model weights to support verification of the reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying ambiguities in our description of the prompt generation process. We address each major comment below and will revise the manuscript to improve clarity on training versus inference procedures.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ATM-Net 'builds a comprehensive anatomical context' and 'enhances class discrimination' via ATPG, HASF, and CCAE rests on the assumption that prompts are generated without ground-truth segmentation masks. The description of ATPG as converting 'image annotations' into prompts leaves open whether these are training-time labels; if ground-truth masks are used, the 8.31% Dice and 4.14-pixel HD95 gains on SPIDER become non-comparable to visual-only baselines such as SpineParseNet and cannot be attributed to the proposed fusion mechanism.

    Authors: The referee correctly notes an ambiguity. The ATPG module converts ground-truth segmentation masks (image annotations) into dual-perspective anatomy-aware prompts during training. This enables the HASF and CCAE modules to learn multi-modal fusion that transfers anatomical context into the visual features. At inference the model operates on visual input alone. We will revise the abstract to explicitly state that prompts are generated from ground-truth annotations exclusively at training time. The reported gains are therefore attributable to the improved visual representations learned via the proposed fusion mechanism, preserving comparability with visual-only baselines evaluated under identical inference conditions. revision: yes

  2. Referee: [Methods] Methods (ATPG, HASF, CCAE descriptions): no explicit statement clarifies whether anatomy-aware prompts are available at inference time or only during training, nor whether the contrastive learning in CCAE uses paired text-image features derived from labels. This detail is load-bearing for the claim of 'consistent improvements regarding class discrimination' and must be resolved to evaluate the architecture's contribution.

    Authors: We agree that the manuscript lacks an explicit statement on this point. Anatomy-aware prompts are generated from ground-truth labels only during training; they are not required at inference. The CCAE module performs class-wise channel-level contrastive learning on paired text-image features derived from labels exclusively during training. We will add a dedicated paragraph in the Methods section (and a corresponding note in the implementation details) that clearly separates the training pipeline (which includes ATPG, HASF, and CCAE) from the inference pipeline (visual input only). This revision will allow readers to evaluate the architecture's contribution without ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with no derivations or self-referential reductions

full rationale

The paper proposes an empirical neural architecture (ATPG, HASF, CCAE modules) for multi-modal segmentation and reports performance gains on standard datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction. This matches the default expectation of no significant circularity for non-derivational ML method papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, derivations, or detailed methods; therefore no free parameters, axioms, or invented entities can be identified from the available text.

pith-pipeline@v0.9.0 · 5809 in / 1214 out tokens · 53517 ms · 2026-05-22T20:58:46.911285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Publicly available clinical bert embeddings

    Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. Publicly available clinical bert embeddings. In Clinical Nat- ural Language Processing Workshop, 2019. 4

  2. [2]

    Deep learning for auto- mated, interpretable classification of lumbar spinal stenosis and facet arthropathy from axial mri

    Upasana Upadhyay Bharadwaj, Miranda Christine, Steven Li, Dean Chou, Valentina Pedoia, Thomas M Link, Cyn- thia T Chin, and Sharmila Majumdar. Deep learning for auto- mated, interpretable classification of lumbar spinal stenosis and facet arthropathy from axial mri. European Radiology,

  3. [3]

    Enhancing medical task performance in gpt-4v: A com- prehensive study on prompt engineering strategies

    Pengcheng Chen, Ziyan Huang, Zhongying Deng, Tianbin Li, Yanzhou Su, Haoyu Wang, Jin Ye, Yu Qiao, and Junjun He. Enhancing medical task performance in gpt-4v: A com- prehensive study on prompt engineering strategies. arXiv,

  4. [4]

    Bi- vlgm: Bi-level class-severity-aware vision-language graph matching for text guided medical image segmentation.IJCV,

    Wenting Chen, Jie Liu, Tianming Liu, and Yixuan Yuan. Bi- vlgm: Bi-level class-severity-aware vision-language graph matching for text guided medical image segmentation.IJCV,

  5. [5]

    A modified bisenet for spinal segmentation

    Yunjiao Deng, Feng Gu, Shuai Wang, Daxing Zeng, Junyan Lu, Haitao Liu, Yulei Hou, and Qinghua Zhang. A modified bisenet for spinal segmentation. In ICIRA, 2023. 2, 5, 6

  6. [6]

    An effective u-net and bisenet complementary network for spine segmentation

    Yunjiao Deng, Feng Gu, Daxing Zeng, Junyan Lu, Haitao Liu, Yulei Hou, and Qinghua Zhang. An effective u-net and bisenet complementary network for spine segmentation. Biomedical Signal Processing and Control, 2024. 2, 5, 6

  7. [7]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv, 2018. 3

  8. [8]

    En- coder fusion network with co-attention embedding for refer- ring image segmentation

    Guang Feng, Zhiwei Hu, Lihe Zhang, and Huchuan Lu. En- coder fusion network with co-attention embedding for refer- ring image segmentation. In CVPR, 2021. 3

  9. [9]

    Optimizing prompts for text-to-image generation

    Yaru Hao, Zewen Chi, Li Dong, and Furu Wei. Optimizing prompts for text-to-image generation. In NeurIPS, 2024. 3

  10. [10]

    Unetr: Transformers for 3d med- ical image segmentation

    Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d med- ical image segmentation. In WACV, 2022. 6

  11. [11]

    Lsw- net: Lightweight deep neural network based on small-world properties for spine mr image segmentation.Journal of Mag- netic Resonance Imaging, 2023

    Siyuan He, Qi Li, Xianda Li, and Mengchao Zhang. Lsw- net: Lightweight deep neural network based on small-world properties for spine mr image segmentation.Journal of Mag- netic Resonance Imaging, 2023. 2

  12. [12]

    A lightweight convolutional neural network based on dynamic level-set loss function for spine mr image segmentation

    Siyuan He, Qi Li, Xianda Li, and Mengchao Zhang. A lightweight convolutional neural network based on dynamic level-set loss function for spine mr image segmentation. Journal of Magnetic Resonance Imaging, 2024. 2

  13. [13]

    Lga: A language guide adapter for advancing the sam model’s capabilities in medi- cal image segmentation

    Jihong Hu, Yinhao Li, Hao Sun, Yu Song, Chujie Zhang, Lanfen Lin, and Yen-Wei Chen. Lga: A language guide adapter for advancing the sam model’s capabilities in medi- cal image segmentation. In MICCAI, 2024. 3

  14. [14]

    Semi-supervised hybrid spine network for segmentation of spine mr images

    Meiyan Huang, Shuoling Zhou, Xiumei Chen, Haoran Lai, and Qianjin Feng. Semi-supervised hybrid spine network for segmentation of spine mr images. CMIG, 2023. 2

  15. [15]

    Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition

    Shih-Cheng Huang, Liyue Shen, Matthew P Lungren, and Serena Yeung. Gloria: A multimodal global-local represen- tation learning framework for label-efficient medical image recognition. In ICCV, 2021. 3

  16. [16]

    nnu-net revisited: A call for rigorous validation in 3d medical image segmentation

    Fabian Isensee, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus Maier-Hein, and Paul F Jaeger. nnu-net revisited: A call for rigorous validation in 3d medical image segmentation. In MICCAI, 2024. 6

  17. [17]

    Diagnosis and management of lumbar spinal stenosis: A review

    Jeffrey N Katz, Zoe E Zimmerman, Hanna Mass, and Melvin C Makhni. Diagnosis and management of lumbar spinal stenosis: A review. JAMA, 2022. 2

  18. [18]

    Restr: Convolution-free referring image segmentation using transformers

    Namyup Kim, Dongwon Kim, Cuiling Lan, Wenjun Zeng, and Suha Kwak. Restr: Convolution-free referring image segmentation using transformers. In CVPR, 2022. 3

  19. [19]

    Low back pain

    Nebojsa Nick Knezevic, Kenneth D Candido, Johan WS Vlaeyen, Jan Van Zundert, and Steven P Cohen. Low back pain. The Lancet, 2021. 2

  20. [20]

    Lvit: language meets vision transformer in medical image seg- mentation

    Zihan Li, Yunxiang Li, Qingde Li, Puyang Wang, Dazhou Guo, Le Lu, Dakai Jin, You Zhang, and Qingqi Hong. Lvit: language meets vision transformer in medical image seg- mentation. IEEE TMI, 2023. 3

  21. [21]

    Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning

    Zhe Li, Laurence T Yang, Bocheng Ren, Xin Nie, Zhangyang Gao, Cheng Tan, and Stan Li. Mlip: Enhancing medical visual representation with divergence encoder and knowledge-guided contrastive learning. In CVPR, 2024. 2

  22. [22]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In ICCV,

  23. [23]

    Gres: Gen- eralized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Gen- eralized referring expression segmentation. In CVPR, 2023. 3

  24. [24]

    Poly- former: Referring image segmentation as sequential polygon generation

    Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Ku- mar Satzoda, Vijay Mahadevan, and R Manmatha. Poly- former: Referring image segmentation as sequential polygon generation. In CVPR, 2023. 3

  25. [25]

    A visual-language foun- dation model for computational pathology.Nature Medicine,

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, et al. A visual-language foun- dation model for computational pathology.Nature Medicine,

  26. [26]

    Collaborative multi-metadata fusion to improve the classifi- cation of lumbar disc herniation

    Shuyi Lu, Jinhua Liu, Xiaojie Wang, and Yuanfeng Zhou. Collaborative multi-metadata fusion to improve the classifi- cation of lumbar disc herniation. IEEE TMI, 2023. 2

  27. [27]

    Image segmentation using text and image prompts

    Timo L ¨uddecke and Alexander Ecker. Image segmentation using text and image prompts. In CVPR, 2022. 3

  28. [28]

    Lumbar intervertebral disc segmentation for computer modeling and simulation

    Rodrigo Matos, Paulo Rui Fernandes, Nuno Matela, and An- dre PG Castro. Lumbar intervertebral disc segmentation for computer modeling and simulation. Computer Methods and Programs in Biomedicine, 2023. 2

  29. [29]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016. 5

  30. [30]

    3d mri brain tumor segmentation using autoencoder regularization

    Andriy Myronenko. 3d mri brain tumor segmentation using autoencoder regularization. In MICCAIW, 2019. 6

  31. [31]

    Ecsu-net: an embedded clustering sliced u-net coupled with fusing strategy for efficient intervertebral disc segmentation and classification

    Anam Nazir, Muhammad Nadeem Cheema, Bin Sheng, Ping Li, Huating Li, Guangtao Xue, Jing Qin, Jinman Kim, and David Dagan Feng. Ecsu-net: an embedded clustering sliced u-net coupled with fusing strategy for efficient intervertebral disc segmentation and classification. IEEE TIP, 2021. 2

  32. [32]

    Spineparsenet: Spine parsing for volumetric mr image by a two-stage segmentation framework with se- mantic image representation

    Shumao Pang, Chunlan Pang, Lei Zhao, Yangfan Chen, Zhi- hai Su, Yujia Zhou, Meiyan Huang, Wei Yang, Hai Lu, and Qianjin Feng. Spineparsenet: Spine parsing for volumetric mr image by a two-stage segmentation framework with se- mantic image representation. IEEE TMI, 2021. 5, 6

  33. [33]

    Dgmsnet: Spine segmentation for mr image by a detection- guided mixed-supervised segmentation network

    Shumao Pang, Chunlan Pang, Zhihai Su, Liyan Lin, Lei Zhao, Yangfan Chen, Yujia Zhou, Hai Lu, and Qianjin Feng. Dgmsnet: Spine segmentation for mr image by a detection- guided mixed-supervised segmentation network. MedIA,

  34. [34]

    Per-clip video object segmen- tation

    Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. Per-clip video object segmen- tation. In CVPR, 2022. 3

  35. [35]

    Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision- language pre-training framework

    Vu Minh Hieu Phan, Yutong Xie, Yuankai Qi, Lingqiao Liu, Liyang Liu, Bowen Zhang, Zhibin Liao, Qi Wu, Minh-Son To, and Johan W Verjans. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision- language pre-training framework. In CVPR, 2024. 2

  36. [36]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 3

  37. [37]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015. 6

  38. [38]

    Automatic semantic segmentation of the lumbar spine: Clinical applicability in a multi-parametric and multi-center study on magnetic resonance images

    Jhon Jairo S ´aenz-Gamboa, Julio Domenech, Antonio Alonso-Manjarr´es, Jon A G ´omez, and Maria de la Iglesia- Vay´a. Automatic semantic segmentation of the lumbar spine: Clinical applicability in a multi-parametric and multi-center study on magnetic resonance images. Artificial Intelligence in Medicine, 2023. 2

  39. [39]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, et al. Toolformer: Language models can teach themselves to use tools. In NeurIPS, 2024. 3

  40. [40]

    Attention gated networks: Learning to leverage salient re- gions in medical images

    Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Hein- rich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention gated networks: Learning to leverage salient re- gions in medical images. MedIA, 2019. 6

  41. [41]

    Test- time prompt tuning for zero-shot generalization in vision- language models

    Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, and Chaowei Xiao. Test- time prompt tuning for zero-shot generalization in vision- language models. In NeurIPS, 2022. 3

  42. [42]

    Large lan- guage models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tan- wani, Heather Cole-Lewis, Stephen Pfohl, et al. Large lan- guage models encode clinical knowledge. Nature, 2023. 3

  43. [43]

    Self-supervised pre-training of swin trans- formers for 3d media

    Yucheng Tang, Dong Yang, Wenqi Li, Holger R Roth, Bennett Landman, Daguang Xu, Vishwesh Nath, and Ali Hatamizadeh. Self-supervised pre-training of swin trans- formers for 3d media. In CVPR, 2022. 3, 6

  44. [44]

    Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning

    Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, An- drew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self- supervised learning. Nature Biomedical Engineering, 2022. 3

  45. [45]

    Lumbar spine segmentation in mr images: a dataset and a public benchmark

    Jasper W van der Graaf, Miranda L van Hooff, Constanti- nus FM Buckens, Matthieu Rutten, Job LC van Susante, Robert Jan Kroeze, Marinus de Kleuver, Bram van Gin- neken, and Nikolas Lessmann. Lumbar spine segmentation in mr images: a dataset and a public benchmark. Scientific Data, 2024. 2, 5

  46. [46]

    Automatic vertebra localization and identifica- tion in ct by spine rectification and anatomically-constrained optimization

    Fakai Wang, Kang Zheng, Le Lu, Jing Xiao, Min Wu, and Shun Miao. Automatic vertebra localization and identifica- tion in ct by spine rectification and anatomically-constrained optimization. In CVPR, 2021. 2

  47. [47]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022. 3

  48. [48]

    Cris: Clip-driven referring image segmentation

    Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. In CVPR, 2022. 3

  49. [49]

    Surgical versus nonsurgical treatment for lumbar degenerative spondylolisthesis

    James N Weinstein, Jon D Lurie, Tor D Tosteson, Brett Hanscom, Anna NA Tosteson, Emily A Blood, Nancy JO Birkmeyer, Alan S Hilibrand, Harry Herkowitz, Frank P Cammisa, et al. Surgical versus nonsurgical treatment for lumbar degenerative spondylolisthesis. NEJM, 2007. 2

  50. [50]

    Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In ICCV,

  51. [51]

    Lavt: Language-aware vi- sion transformer for referring image segmentation

    Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng- shuang Zhao, and Philip HS Torr. Lavt: Language-aware vi- sion transformer for referring image segmentation. InCVPR,

  52. [52]

    Madapter: A better interaction between image and language for medical image segmentation

    Xu Zhang, Bo Ni, Yang Yang, and Lefei Zhang. Madapter: A better interaction between image and language for medical image segmentation. In MICCAI, 2024. 3

  53. [53]

    Spinemamba: Enhancing 3d spinal segmentation in clinical imaging through residual vi- sual mamba layers and shape priors

    Zhiqing Zhang, Tianyong Liu, Guojia Fan, Bin Li, Qian- jin Feng, and Shoujun Zhou. Spinemamba: Enhancing 3d spinal segmentation in clinical imaging through residual vi- sual mamba layers and shape priors. arXiv, 2024. 2

  54. [54]

    Deep learning-based high-accuracy quantitation for lumbar intervertebral disc de- generation from mri

    Hua-Dong Zheng, Yue-Li Sun, De-Wei Kong, Meng-Chen Yin, Jiang Chen, Yong-Peng Lin, Xue-Feng Ma, Hong-Shen Wang, Guang-Jie Yuan, Min Yao, et al. Deep learning-based high-accuracy quantitation for lumbar intervertebral disc de- generation from mri. Nature Communications, 2022. 2

  55. [55]

    Text promptable surgical in- strument segmentation with vision-language models

    Zijian Zhou, Oluwatosin Alabi, Meng Wei, Tom Ver- cauteren, and Miaojing Shi. Text promptable surgical in- strument segmentation with vision-language models. In NeurIPS, 2023. 3