TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
Pith reviewed 2026-05-11 13:30 UTC · model grok-4.3
The pith
TransUNet shows that a transformer encoder on CNN features can supply the global context U-Net needs for accurate medical image segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that transformers can serve as strong encoders for medical image segmentation: tokenized image patches extracted from a CNN feature map are processed by a transformer to capture global contexts, after which a U-Net-style decoder upsamples these features and fuses them with the original high-resolution CNN feature maps to recover precise localization, yielding better results than competing methods on multi-organ and cardiac segmentation benchmarks.
What carries the argument
The hybrid encoder-decoder where a transformer processes tokenized patches from a CNN feature map to extract global self-attention context, which is then merged via skip connections with high-resolution CNN features inside the decoder for localization.
If this is right
- Medical segmentation pipelines can replace pure CNN encoders with transformer encoders while keeping the decoder and skip connections intact.
- Long-range dependencies become explicitly modeled without sacrificing the fine boundary detail that CNN features provide.
- Performance gains appear across both multi-organ and cardiac tasks when the same hybrid fusion is applied.
- The architecture offers a template for other encoder-decoder segmentation models that need both global and local cues.
Where Pith is reading between the lines
- Similar tokenization-plus-fusion steps could be tested on non-medical imaging domains that already use U-Net variants.
- The approach may reduce the amount of local annotation needed if global context helps the model generalize from fewer examples.
- Future work could measure whether the transformer encoder alone, without the CNN backbone, suffices on larger training sets.
Load-bearing premise
That adding global context from the transformer to the CNN skip connections will reliably raise localization accuracy on new clinical data without introducing fresh failure modes.
What would settle it
A head-to-head evaluation on a held-out multi-center medical dataset in which the hybrid model shows no gain or a drop in Dice or Hausdorff metrics relative to a standard U-Net baseline.
read the original abstract
Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TransUNet, a hybrid CNN-Transformer architecture for medical image segmentation. A CNN backbone extracts feature maps that are tokenized and processed by a Transformer encoder to capture global context; these are then decoded via a U-Net-style upsampling path that fuses the Transformer output with high-resolution CNN skip connections to recover fine localization details. The central claim is that this design yields superior empirical performance over competing methods on multi-organ and cardiac segmentation tasks.
Significance. If the performance gains are robust, the work is significant for demonstrating that Transformers can function as strong encoders in medical imaging by explicitly modeling long-range dependencies while the U-Net decoder mitigates localization weaknesses. The public release of code and models is a clear strength that supports reproducibility and extension by the community.
major comments (2)
- [§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.
- [§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.
minor comments (2)
- [§3.2] §3.2: The precise reshaping of the Transformer output sequence back into a 2-D feature map before decoder fusion is described at a high level; adding an equation or explicit tensor-shape diagram would improve clarity and reproducibility.
- [Figure 2] Figure 2: The architecture diagram would benefit from explicit annotation of the tokenization step and the exact skip-connection fusion points to match the textual description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of our work. We address the major comments point-by-point below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.
Authors: We appreciate the referee's point regarding the need for capacity-matched ablations to isolate the benefit of the Transformer encoder. Our manuscript compares TransUNet to the standard U-Net and other established baselines under consistent training protocols. The improvements are observed across different medical segmentation tasks, supporting that the global context modeling from the Transformer contributes to better performance. To directly address this, we will include parameter counts for all models in the revised manuscript and add an ablation study using a deeper CNN backbone with comparable parameters where possible. revision: partial
-
Referee: [§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.
Authors: We concur that providing statistical analysis would enhance the reliability of our results, particularly given the characteristics of medical imaging datasets. We will update the experimental section to include multiple runs with different random seeds, report standard deviations or confidence intervals, and conduct appropriate statistical tests to validate the significance of the performance gains in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical architecture proposal with benchmark validation
full rationale
The paper introduces TransUNet as a hybrid CNN-Transformer encoder-decoder for medical segmentation and supports its claims solely through empirical results on public benchmarks (multi-organ and cardiac segmentation). No equations, predictions, or first-principles derivations appear in the provided text; the architecture is defined explicitly and then evaluated, without any fitted parameter being relabeled as a prediction or any load-bearing premise reducing to a self-citation. The superiority statements are presented as experimental outcomes rather than closed-loop theoretical necessities, making the derivation chain self-contained and non-circular.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/HierarchyEmergence.leanhierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
-
CoralLite: {\mu}CT Reconstruction of Coral Colonies from Individual Corallites
CoralLite dataset and V-Trans-UNet baseline enable segmentation of individual corallites from μCT scans of Porites coral colonies with reported Dice scores of 0.77 on same-colony slices and 0.63 on unrelated specimens.
-
Weakly Supervised Segmentation as Semantic-Based Regularization
Differentiable fuzzy logic constraints fine-tune SAM to generate higher-quality pseudo-labels, enabling a second-stage model to reach state-of-the-art weakly supervised segmentation on Pascal VOC and REFUGE2, sometime...
-
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
-
Camyla: Scaling Autonomous Research in Medical Image Segmentation
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
-
Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening
NPS-Net formulates optic disc and cup segmentation as nested radially monotone polar occupancy estimation to guarantee star-convexity, nesting, and high accuracy for glaucoma screening.
-
XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...
-
Fast Single Nitrogen-Vacancy Center Ramsey Characterization using a Physics-Informed Neural Network
NVRNet uses pretrained simulation-based U-Nets with attention and parameter-efficient adapters, followed by a transformer estimator, to reconstruct clean Ramsey waveforms and infer hyperfine parameters from minimal-sw...
-
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.
-
U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation
U-Mamba is a hybrid CNN-SSM architecture that outperforms prior CNN and Transformer networks on biomedical image segmentation tasks by efficiently modeling long-range dependencies.
-
SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction
SO-Mamba introduces state-ownership routing in Mamba regularizers for unrolled MRI reconstruction to separate resident carrier content from non-resident evidence across stages.
-
Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation
SGP-Net improves few-shot medical image segmentation by using spectral frequency decomposition for cue disentanglement and geodesic matching on feature manifolds instead of cosine similarity.
-
VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption
VoxShield disrupts inter-slice frequency consistency and semantic logits in 3D medical images to degrade segmentation model performance to near-random levels with epsilon=4/255 perturbations.
-
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
-
Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning
A hybrid CNN-transformer model with multi-task learning achieves 91.3% WBC classification accuracy and 0.72 Pearson correlation for CD16 expression regression from label-free DPC images, augmented by LLM-generated summaries.
-
Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI
A three-stage pipeline detects 16 landmarks, coarsely segments 12 labels, and refines them into 26 structures using landmark constraints to improve accuracy in subcortical MRI segmentation.
-
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
-
ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation
ESICA delivers state-of-the-art accuracy on a five-modality 3D medical segmentation benchmark while offering a compact variant with far fewer parameters.
-
MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention
MedFlowSeg is a conditional flow matching model for medical image segmentation that adds dual-branch spatial attention and frequency-aware attention to achieve more efficient inference than diffusion models while impr...
-
Rethinking Uncertainty in Segmentation: From Estimation to Decision
Uncertainty optimization alone misses most safety gains; a decision-stage deferral policy removes up to 80% segmentation errors at 25% pixel deferral with cross-dataset robustness, while calibration does not improve d...
-
Human Gaze-based Dual Teacher Guidance Learning for Semi-Supervised Medical Image Segmentation
HG-DTGL integrates human gaze as an extra teacher in mean-teacher learning via GazeMix, MGP module and Gaze Loss, reporting superior segmentation across ten organs on multiple modalities.
-
A Fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation
A hybrid Transformer-UNet model with energy-shifting inputs generates 6 MV LINAC dose maps from monoenergetic data, achieving over 98% gamma passing rate (3%/3mm) versus full Monte Carlo for prostate radiotherapy.
-
CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing
The paper defines the Conformal Hallucination Estimation Metric (CHEM) that localizes hallucination-prone regions in image reconstruction models via multiscale representations and distribution-free conformal regression.
-
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark ...
-
Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baseline...
-
Primus: Enforcing Attention Usage for 3D Medical Image Segmentation
Primus and PrimusV2 are Transformer-centric models that match or exceed nnU-Net and top CNNs on nine 3D medical segmentation datasets by enforcing attention usage.
-
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
Learn2Synth optimizes data synthesis parameters with hypergradients to train segmentation networks solely on synthetic brain images that generalize to real scans.
-
M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation
M²SNet uses intra- and inter-layer multi-scale subtraction units plus a training-free LossNet to generate difference features that reduce redundancy in decoder fusion for medical segmentation.
-
Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation
UGCP is a differentiable plug-in that performs uncertainty-guided local logit updates to refine vessel segmentations, reduce disconnections, and improve structural consistency on 2D and 3D medical datasets.
-
RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection
RadGenome-Anatomy is a large-scale chest radiograph dataset with anatomy labels obtained by projecting 3D CT masks into 2D radiographic space for 210 structures in 25,692 studies.
-
DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy
DepthPolyp is a compact model using pseudo-depth multi-task learning and efficient feature modules that delivers strong generalization and real-time performance for polyp segmentation in noisy colonoscopy data.
-
MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation
MHMamba combines a U-Net with multi-head Mamba, channel calibration, and adaptive skip fusion to improve 3D brain tumor segmentation accuracy and small-lesion sensitivity on BraTS datasets while retaining linear complexity.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images
A TransUNet-based segmentation followed by texture comparison classifies fatty pancreas in ultrasound with 89.7% accuracy on a small clinical dataset.
-
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
-
Lightning Unified Video Editing via In-Context Sparse Attention
ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
-
Domain-Adaptive Arrhythmia Classification Using a Hybrid Transformer on Wearable Heart Signals
A hybrid transformer combining ECG morphology and HRV features with MMD domain adaptation achieves 95% F1-macro on unseen wearable data for arrhythmia classification.
-
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions
Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...
-
SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT
SAIL integrates anatomical priors at the representation level with semantic features via fusion to produce more anatomically aligned attribution maps in OCT without altering existing explainability techniques.
-
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model
Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
-
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
-
Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images
A semi-supervised framework using weighted knowledge distillation and SinusCycle-GAN refinement achieves 96.35% Dice score for maxillary sinus segmentation in panoramic X-rays from 2,511 patients.
-
EDU-Net: Retinal Pathological Fluid Segmentation in OCT Images with Multiscale Feature Fusion and Boundary Optimization
EDU-Net fuses multiscale local and global features with boundary optimization to achieve state-of-the-art segmentation of intraretinal and subretinal fluid in OCT images.
-
RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation
RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.
-
UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation
UniSemAlign aligns text and prototype representations with visual features to generate better supervision signals for semi-supervised segmentation, reporting Dice gains of up to 8.6% on CRAG with 10% labels.
-
Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.
-
CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation
CardioSAM introduces a topology-aware decoder and boundary refinement module to elevate SAM's performance on cardiac MRI, achieving a 93.39% Dice score.
-
A novel attention mechanism for noise-adaptive and robust segmentation of microtubules in microscopy images
ASE_Res_UNet with a novel noise-adaptive attention mechanism outperforms ablated variants and alternative architectures in segmenting microtubules from noisy synthetic and real microscopy images while using fewer para...
-
FedRef: Bayesian Fine-Tuning using a Reference Model to Mitigate Catastrophic Forgetting for Heterogeneous Federated Learning
FedRef uses a temporally aggregated reference model and MAP regularization for server-side fine-tuning to reduce forgetting and drift in non-IID federated learning, showing better accuracy and lower client compute on ...
-
MSLAU-Net: A Hybrid CNN-Transformer Network for Medical Image Segmentation
MSLAU-Net proposes a hybrid CNN-Transformer architecture using multi-scale linear attention and lightweight top-down aggregation that outperforms prior methods on medical segmentation benchmarks across three modalities.
-
Uncertainty-Based Ensemble Learning in CMR Semantic Segmentation
Uncertainty-weighted ensemble learning achieves near state-of-the-art overall segmentation accuracy while outperforming prior models on end slices in cardiac MRI.
-
ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation
ConvNeXt-FD pairs a ConvNeXt backbone with fractal-dimension boundary regularization inside a U-Net and reports competitive Dice and related scores on six biomedical segmentation benchmarks.
-
Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation
Med-DisSeg uses a dispersive loss on batch representations plus adaptive multi-scale decoding to achieve state-of-the-art fine-grained segmentation on five medical imaging datasets.
-
RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.
-
Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection
A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...
-
an interpretable vision transformer framework for automated brain tumor classification
Vision Transformer with CLAHE preprocessing, two-stage fine-tuning, MixUp/CutMix, EMA, TTA, and attention rollout achieves 99.29% accuracy and 99.25% macro F1 on four-class brain tumor MRI classification from 7023 scans.
-
MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer
MAE self-supervised pretraining of nnFormer yields higher Dice scores, faster convergence, and better generalization when labeled medical segmentation data is scarce.
-
Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment
A CNN-attention model achieves 99.2% accuracy on seen MRI sites and 75.5% on unseen heterogeneous sites for motion artifact quality assessment.
-
ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
ASGNet combines a spectrum-guided non-local perception module, multi-source semantic extractor, and dense cross-layer decoder to outperform 21 prior methods on five polyp segmentation benchmarks.
-
PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation
PBE-UNet adds scale-aware aggregation and progressive boundary expansion modules to U-Net and reports better segmentation performance than prior methods on four ultrasound datasets.
Reference graph
Works this paper leans on
-
[1]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[2]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
work page 2009
-
[3]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[4]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 12 J. Chen et al
work page 2021
-
[5]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Fu, S., Lu, Y., Wang, Y., Zhou, Y., Shen, W., Fishman, E., Yuille, A.: Domain adaptive relational reasoning for 3d multi-organ segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 656–666. Springer (2020)
work page 2020
-
[6]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[7]
IEEE transactions on medical imaging 37(12), 2663–2674 (2018)
Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging 37(12), 2663–2674 (2018)
work page 2018
-
[8]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
work page 2015
-
[9]
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
work page 2016
-
[10]
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning where to look for the pancreas. MIDL (2018)
work page 2018
-
[11]
In: International Conference on Machine Learning
Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. In: International Conference on Machine Learning. pp. 4055–
-
[12]
In: International Conference on Medical image computing and computer-assisted intervention
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
work page 2015
-
[13]
Medical image analysis 53, 197–207 (2019)
Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueck- ert, D.: Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207 (2019)
work page 2019
-
[14]
In: Advances in neural information processing systems
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
work page 2017
-
[15]
In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)
work page 2018
-
[16]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Yu, L., Cheng, J.Z., Dou, Q., Yang, X., Chen, H., Qin, J., Heng, P.A.: Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 287–295. Springer (2017)
work page 2017
-
[17]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ seg- mentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8280–8289 (2018)
work page 2018
-
[18]
Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
-
[19]
In: International conference on medical image computing and computer-assisted intervention
Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed- point model for pancreas segmentation in abdominal ct scans. In: International conference on medical image computing and computer-assisted intervention. pp. 693–701. Springer (2017)
work page 2017
-
[20]
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Im- Title Suppressed Due to Excessive Length 13 age Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Springer (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.