pith. machine review for the scientific record. sign in

arxiv: 2102.04306 · v1 · submitted 2021-02-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical image segmentationtransformer encoderU-Nethybrid architectureglobal contextskip connectionsmulti-organ segmentation
0
0 comments X

The pith

TransUNet shows that a transformer encoder on CNN features can supply the global context U-Net needs for accurate medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to combine the long-range modeling strength of transformers with the localization precision of U-Net, arguing that pure convolutional networks miss distant dependencies while pure transformers lack fine spatial detail. It does this by feeding tokenized patches from a CNN feature map into a transformer encoder, then routing the encoded output through a decoder that merges it with high-resolution CNN skip connections. A reader would care because medical segmentation underpins diagnosis and treatment planning, and the hybrid design is presented as a direct way to improve boundary accuracy and overall performance on tasks such as multi-organ and cardiac imaging.

Core claim

The central claim is that transformers can serve as strong encoders for medical image segmentation: tokenized image patches extracted from a CNN feature map are processed by a transformer to capture global contexts, after which a U-Net-style decoder upsamples these features and fuses them with the original high-resolution CNN feature maps to recover precise localization, yielding better results than competing methods on multi-organ and cardiac segmentation benchmarks.

What carries the argument

The hybrid encoder-decoder where a transformer processes tokenized patches from a CNN feature map to extract global self-attention context, which is then merged via skip connections with high-resolution CNN features inside the decoder for localization.

If this is right

  • Medical segmentation pipelines can replace pure CNN encoders with transformer encoders while keeping the decoder and skip connections intact.
  • Long-range dependencies become explicitly modeled without sacrificing the fine boundary detail that CNN features provide.
  • Performance gains appear across both multi-organ and cardiac tasks when the same hybrid fusion is applied.
  • The architecture offers a template for other encoder-decoder segmentation models that need both global and local cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tokenization-plus-fusion steps could be tested on non-medical imaging domains that already use U-Net variants.
  • The approach may reduce the amount of local annotation needed if global context helps the model generalize from fewer examples.
  • Future work could measure whether the transformer encoder alone, without the CNN backbone, suffices on larger training sets.

Load-bearing premise

That adding global context from the transformer to the CNN skip connections will reliably raise localization accuracy on new clinical data without introducing fresh failure modes.

What would settle it

A head-to-head evaluation on a held-out multi-center medical dataset in which the hybrid model shows no gain or a drop in Dice or Hausdorff metrics relative to a standard U-Net baseline.

read the original abstract

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TransUNet, a hybrid CNN-Transformer architecture for medical image segmentation. A CNN backbone extracts feature maps that are tokenized and processed by a Transformer encoder to capture global context; these are then decoded via a U-Net-style upsampling path that fuses the Transformer output with high-resolution CNN skip connections to recover fine localization details. The central claim is that this design yields superior empirical performance over competing methods on multi-organ and cardiac segmentation tasks.

Significance. If the performance gains are robust, the work is significant for demonstrating that Transformers can function as strong encoders in medical imaging by explicitly modeling long-range dependencies while the U-Net decoder mitigates localization weaknesses. The public release of code and models is a clear strength that supports reproducibility and extension by the community.

major comments (2)
  1. [§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.
  2. [§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.
minor comments (2)
  1. [§3.2] §3.2: The precise reshaping of the Transformer output sequence back into a 2-D feature map before decoder fusion is described at a high level; adding an equation or explicit tensor-shape diagram would improve clarity and reproducibility.
  2. [Figure 2] Figure 2: The architecture diagram would benefit from explicit annotation of the tokenization step and the exact skip-connection fusion points to match the textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of our work. We address the major comments point-by-point below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.

    Authors: We appreciate the referee's point regarding the need for capacity-matched ablations to isolate the benefit of the Transformer encoder. Our manuscript compares TransUNet to the standard U-Net and other established baselines under consistent training protocols. The improvements are observed across different medical segmentation tasks, supporting that the global context modeling from the Transformer contributes to better performance. To directly address this, we will include parameter counts for all models in the revised manuscript and add an ablation study using a deeper CNN backbone with comparable parameters where possible. revision: partial

  2. Referee: [§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.

    Authors: We concur that providing statistical analysis would enhance the reliability of our results, particularly given the characteristics of medical imaging datasets. We will update the experimental section to include multiple runs with different random seeds, report standard deviations or confidence intervals, and conduct appropriate statistical tests to validate the significance of the performance gains in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with benchmark validation

full rationale

The paper introduces TransUNet as a hybrid CNN-Transformer encoder-decoder for medical segmentation and supports its claims solely through empirical results on public benchmarks (multi-organ and cardiac segmentation). No equations, predictions, or first-principles derivations appear in the provided text; the architecture is defined explicitly and then evaluated, without any fitted parameter being relabeled as a prediction or any load-bearing premise reducing to a self-citation. The superiority statements are presented as experimental outcomes rather than closed-loop theoretical necessities, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical superiority of a new neural architecture; no additional free parameters, axioms, or invented entities are introduced beyond standard deep-learning components.

pith-pipeline@v0.9.0 · 5566 in / 957 out tokens · 41374 ms · 2026-05-11T13:30:41.413755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Weakly Supervised Segmentation as Semantic-Based Regularization

    cs.CV 2026-05 unverdicted novelty 7.0

    Differentiable fuzzy logic constraints fine-tune SAM to generate higher-quality pseudo-labels, enabling a second-stage model to reach state-of-the-art weakly supervised segmentation on Pascal VOC and REFUGE2, sometime...

  2. RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

    cs.CV 2026-05 unverdicted novelty 7.0

    RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.

  3. Camyla: Scaling Autonomous Research in Medical Image Segmentation

    cs.AI 2026-04 unverdicted novelty 7.0

    Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

  4. Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening

    cs.CV 2026-04 unverdicted novelty 7.0

    NPS-Net formulates optic disc and cup segmentation as nested radially monotone polar occupancy estimation to guarantee star-convexity, nesting, and high accuracy for glaucoma screening.

  5. XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

    cs.CV 2026-03 unverdicted novelty 7.0

    XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...

  6. Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    A hybrid CNN-transformer model with multi-task learning achieves 91.3% WBC classification accuracy and 0.72 Pearson correlation for CD16 expression regression from label-free DPC images, augmented by LLM-generated summaries.

  7. Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

    cs.CV 2026-05 unverdicted novelty 6.0

    A three-stage pipeline detects 16 landmarks, coarsely segments 12 labels, and refines them into 26 structures using landmark constraints to improve accuracy in subcortical MRI segmentation.

  8. DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

    cs.CV 2026-05 unverdicted novelty 6.0

    DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.

  9. ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    ESICA delivers state-of-the-art accuracy on a five-modality 3D medical segmentation benchmark while offering a compact variant with far fewer parameters.

  10. MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

    cs.CV 2026-04 unverdicted novelty 6.0

    MedFlowSeg is a conditional flow matching model for medical image segmentation that adds dual-branch spatial attention and frequency-aware attention to achieve more efficient inference than diffusion models while impr...

  11. Rethinking Uncertainty in Segmentation: From Estimation to Decision

    cs.CV 2026-04 unverdicted novelty 6.0

    Uncertainty optimization alone misses most safety gains; a decision-stage deferral policy removes up to 80% segmentation errors at 25% pixel deferral with cross-dataset robustness, while calibration does not improve d...

  12. Human Gaze-based Dual Teacher Guidance Learning for Semi-Supervised Medical Image Segmentation

    eess.IV 2026-04 unverdicted novelty 6.0

    HG-DTGL integrates human gaze as an extra teacher in mean-teacher learning via GazeMix, MGP module and Gaze Loss, reporting superior segmentation across ten organs on multiple modalities.

  13. A fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation

    physics.med-ph 2026-04 unverdicted novelty 6.0

    A hybrid Transformer-UNet model with energy-shifting inputs generates 6 MV LINAC dose maps from monoenergetic data, achieving over 98% gamma passing rate (3%/3mm) versus full Monte Carlo for prostate radiotherapy.

  14. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

  15. A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images

    cs.CV 2026-05 unverdicted novelty 5.0

    A TransUNet-based segmentation followed by texture comparison classifies fatty pancreas in ultrasound with 89.7% accuracy on a small clinical dataset.

  16. Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

    cs.CV 2026-05 unverdicted novelty 5.0

    A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.

  17. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  18. Domain-Adaptive Arrhythmia Classification Using a Hybrid Transformer on Wearable Heart Signals

    eess.SP 2026-05 unverdicted novelty 5.0

    A hybrid transformer combining ECG morphology and HRV features with MMD domain adaptation achieves 95% F1-macro on unseen wearable data for arrhythmia classification.

  19. Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions

    eess.SP 2026-05 conditional novelty 5.0

    Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...

  20. SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT

    cs.CV 2026-05 unverdicted novelty 5.0

    SAIL integrates anatomical priors at the representation level with semantic features via fusion to produce more anatomically aligned attribution maps in OCT without altering existing explainability techniques.

  21. CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model

    cs.CV 2026-04 unverdicted novelty 5.0

    Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.

  22. MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.

  23. Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images

    cs.CV 2026-04 unverdicted novelty 5.0

    A semi-supervised framework using weighted knowledge distillation and SinusCycle-GAN refinement achieves 96.35% Dice score for maxillary sinus segmentation in panoramic X-rays from 2,511 patients.

  24. EDU-Net: Retinal Pathological Fluid Segmentation in OCT Images with Multiscale Feature Fusion and Boundary Optimization

    eess.IV 2026-04 unverdicted novelty 5.0

    EDU-Net fuses multiscale local and global features with boundary optimization to achieve state-of-the-art segmentation of intraretinal and subretinal fluid in OCT images.

  25. RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.

  26. UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation

    cs.CV 2026-04 unverdicted novelty 5.0

    UniSemAlign aligns text and prototype representations with visual features to generate better supervision signals for semi-supervised segmentation, reporting Dice gains of up to 8.6% on CRAG with 10% labels.

  27. Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

    cs.CV 2026-04 unverdicted novelty 5.0

    A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.

  28. CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation

    cs.CV 2026-03 accept novelty 5.0

    CardioSAM introduces a topology-aware decoder and boundary refinement module to elevate SAM's performance on cardiac MRI, achieving a 93.39% Dice score.

  29. Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    Med-DisSeg uses a dispersive loss on batch representations plus adaptive multi-scale decoding to achieve state-of-the-art fine-grained segmentation on five medical imaging datasets.

  30. RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

    cs.CV 2026-05 unverdicted novelty 4.0

    RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.

  31. Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection

    cs.CV 2026-05 unverdicted novelty 4.0

    A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...

  32. an interpretable vision transformer framework for automated brain tumor classification

    cs.CV 2026-04 unverdicted novelty 4.0

    Vision Transformer with CLAHE preprocessing, two-stage fine-tuning, MixUp/CutMix, EMA, TTA, and attention rollout achieves 99.29% accuracy and 99.25% macro F1 on four-class brain tumor MRI classification from 7023 scans.

  33. MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer

    cs.CV 2026-04 unverdicted novelty 4.0

    MAE self-supervised pretraining of nnFormer yields higher Dice scores, faster convergence, and better generalization when labeled medical segmentation data is scarce.

  34. Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment

    cs.CV 2026-04 unverdicted novelty 4.0

    A CNN-attention model achieves 99.2% accuracy on seen MRI sites and 75.5% on unseen heterogeneous sites for motion artifact quality assessment.

  35. ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation

    cs.CV 2026-04 unverdicted novelty 4.0

    ASGNet combines a spectrum-guided non-local perception module, multi-source semantic extractor, and dense cross-layer decoder to outperform 21 prior methods on five polyp segmentation benchmarks.

  36. PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation

    cs.CV 2026-04 unverdicted novelty 4.0

    PBE-UNet adds scale-aware aggregation and progressive boundary expansion modules to U-Net and reports better segmentation performance than prior methods on four ultrasound datasets.

  37. TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation

    cs.CV 2026-04 unverdicted novelty 4.0

    TAMISeg uses text prompts and semantic distillation from a frozen DINOv3 teacher inside a consistency-aware encoder and scale-adaptive decoder to outperform prior uni- and multi-modal methods on polyp and COVID-19 seg...

  38. SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation

    cs.CV 2026-04 unverdicted novelty 4.0

    SwinTextUNet integrates CLIP text guidance into Swin U-Net via cross-attention and convolutional fusion, achieving 86.47% Dice and 78.2% IoU on QaTaCOV19 medical image segmentation.

  39. Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

    cs.CV 2026-04 unverdicted novelty 4.0

    STGR framework integrates LLaMA-3-V and MedSAM via text-to-vision distillation and graph reasoning, achieving 81.5% DSC on LIDC-IDRI with under 1% parameter updates and high cross-fold stability.

  40. Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery

    cs.CV 2026-04 unverdicted novelty 2.0

    DeepLabV3 matches SegFormer performance in multi-class surgical instrument segmentation while convolutional baselines like UNet remain competitive on the SAR-RARP50 dataset.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 40 Pith papers · 2 internal anchors

  1. [1]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

  2. [2]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  3. [3]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  4. [4]

    In: ICLR (2021) 12 J

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 12 J. Chen et al

  5. [5]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Fu, S., Lu, Y., Wang, Y., Zhou, Y., Shen, W., Fishman, E., Yuille, A.: Domain adaptive relational reasoning for 3d multi-organ segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 656–666. Springer (2020)

  6. [6]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  7. [7]

    IEEE transactions on medical imaging 37(12), 2663–2674 (2018)

    Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging 37(12), 2663–2674 (2018)

  8. [8]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

  9. [9]

    In: 3DV (2016)

    Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)

  10. [10]

    MIDL (2018)

    Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning where to look for the pancreas. MIDL (2018)

  11. [11]

    In: International Conference on Machine Learning

    Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. In: International Conference on Machine Learning. pp. 4055–

  12. [12]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

  13. [13]

    Medical image analysis 53, 197–207 (2019)

    Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueck- ert, D.: Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207 (2019)

  14. [14]

    In: Advances in neural information processing systems

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)

  15. [15]

    In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)

  16. [16]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Yu, L., Cheng, J.Z., Dou, Q., Yang, X., Chen, H., Qin, J., Heng, P.A.: Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 287–295. Springer (2017)

  17. [17]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ seg- mentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8280–8289 (2018)

  18. [18]

    arXiv preprint arXiv:2012.15840 (2020)

    Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)

  19. [19]

    In: International conference on medical image computing and computer-assisted intervention

    Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed- point model for pancreas segmentation in abdominal ct scans. In: International conference on medical image computing and computer-assisted intervention. pp. 693–701. Springer (2017)

  20. [20]

    In: Deep Learning in Medical Im- Title Suppressed Due to Excessive Length 13 age Analysis and Multimodal Learning for Clinical Decision Support, pp

    Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Im- Title Suppressed Due to Excessive Length 13 age Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Springer (2018)