arxiv: 2102.04306 · v1 · submitted 2021-02-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen , Yongyi Lu , Qihang Yu , Xiangde Luo , Ehsan Adeli , Yan Wang , Le Lu , Alan L. Yuille

show 1 more author

Yuyin Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationtransformer encoderU-Nethybrid architectureglobal contextskip connectionsmulti-organ segmentation

0 comments

The pith

TransUNet shows that a transformer encoder on CNN features can supply the global context U-Net needs for accurate medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to combine the long-range modeling strength of transformers with the localization precision of U-Net, arguing that pure convolutional networks miss distant dependencies while pure transformers lack fine spatial detail. It does this by feeding tokenized patches from a CNN feature map into a transformer encoder, then routing the encoded output through a decoder that merges it with high-resolution CNN skip connections. A reader would care because medical segmentation underpins diagnosis and treatment planning, and the hybrid design is presented as a direct way to improve boundary accuracy and overall performance on tasks such as multi-organ and cardiac imaging.

Core claim

The central claim is that transformers can serve as strong encoders for medical image segmentation: tokenized image patches extracted from a CNN feature map are processed by a transformer to capture global contexts, after which a U-Net-style decoder upsamples these features and fuses them with the original high-resolution CNN feature maps to recover precise localization, yielding better results than competing methods on multi-organ and cardiac segmentation benchmarks.

What carries the argument

The hybrid encoder-decoder where a transformer processes tokenized patches from a CNN feature map to extract global self-attention context, which is then merged via skip connections with high-resolution CNN features inside the decoder for localization.

If this is right

Medical segmentation pipelines can replace pure CNN encoders with transformer encoders while keeping the decoder and skip connections intact.
Long-range dependencies become explicitly modeled without sacrificing the fine boundary detail that CNN features provide.
Performance gains appear across both multi-organ and cardiac tasks when the same hybrid fusion is applied.
The architecture offers a template for other encoder-decoder segmentation models that need both global and local cues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tokenization-plus-fusion steps could be tested on non-medical imaging domains that already use U-Net variants.
The approach may reduce the amount of local annotation needed if global context helps the model generalize from fewer examples.
Future work could measure whether the transformer encoder alone, without the CNN backbone, suffices on larger training sets.

Load-bearing premise

That adding global context from the transformer to the CNN skip connections will reliably raise localization accuracy on new clinical data without introducing fresh failure modes.

What would settle it

A head-to-head evaluation on a held-out multi-center medical dataset in which the hybrid model shows no gain or a drop in Dice or Hausdorff metrics relative to a standard U-Net baseline.

read the original abstract

Medical image segmentation is an essential prerequisite for developing healthcare systems, especially for disease diagnosis and treatment planning. On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard and achieved tremendous success. However, due to the intrinsic locality of convolution operations, U-Net generally demonstrates limitations in explicitly modeling long-range dependency. Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures with innate global self-attention mechanisms, but can result in limited localization abilities due to insufficient low-level details. In this paper, we propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation. On one hand, the Transformer encodes tokenized image patches from a convolution neural network (CNN) feature map as the input sequence for extracting global contexts. On the other hand, the decoder upsamples the encoded features which are then combined with the high-resolution CNN feature maps to enable precise localization. We argue that Transformers can serve as strong encoders for medical image segmentation tasks, with the combination of U-Net to enhance finer details by recovering localized spatial information. TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation. Code and models are available at https://github.com/Beckschen/TransUNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

TransUNet tokenizes CNN feature maps for a transformer encoder then pairs it with a U-Net decoder and skips, which is a clean way to add global context without losing localization, though the size of the gains still needs checking against matched baselines. The paper does a straightforward job of spelling out the problem: standard U-Net misses long-range dependencies because of local convolutions, while pure transformers struggle with precise boundaries because they lack low-level spatial cues. By running the transformer only on the downsampled CNN tokens and then upsampling with the usual skip connections, the design keeps the strengths of both. That specific fusion is not just a routine swap; it is a targeted fix for medical volumes where organs can be spread out but still need sharp edges. The released code is a plus for anyone who wants to try it on their own scans. The main soft spot is the empirical side. The abstract says it beats competing methods on multi-organ and cardiac tasks, but without the actual numbers, statistical tests, or ablation tables in front of me it is hard to judge whether the lift comes from the transformer principle or simply from extra capacity and tuning. Medical datasets are small and scanner-dependent, so the usual worries about overfitting and domain shift apply. If the full experiments include parameter-matched CNN baselines and some cross-scanner checks, the claims hold up better; if not, the advantage looks softer. This is aimed at people who already run U-Net style models in clinical pipelines and want a drop-in upgrade that still feels familiar. It is solid enough on the architecture side and timely enough on the application side that it deserves a serious referee to go through the numbers and ablations in detail. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TransUNet, a hybrid CNN-Transformer architecture for medical image segmentation. A CNN backbone extracts feature maps that are tokenized and processed by a Transformer encoder to capture global context; these are then decoded via a U-Net-style upsampling path that fuses the Transformer output with high-resolution CNN skip connections to recover fine localization details. The central claim is that this design yields superior empirical performance over competing methods on multi-organ and cardiac segmentation tasks.

Significance. If the performance gains are robust, the work is significant for demonstrating that Transformers can function as strong encoders in medical imaging by explicitly modeling long-range dependencies while the U-Net decoder mitigates localization weaknesses. The public release of code and models is a clear strength that supports reproducibility and extension by the community.

major comments (2)

[§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.
[§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.

minor comments (2)

[§3.2] §3.2: The precise reshaping of the Transformer output sequence back into a 2-D feature map before decoder fusion is described at a high level; adding an equation or explicit tensor-shape diagram would improve clarity and reproducibility.
[Figure 2] Figure 2: The architecture diagram would benefit from explicit annotation of the tokenization step and the exact skip-connection fusion points to match the textual description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of our work. We address the major comments point-by-point below.

read point-by-point responses

Referee: [§4] §4 (Experiments): The superiority claims rest on comparisons to U-Net and other baselines, but no ablation is reported against a pure CNN encoder (e.g., deeper/wider ResNet or U-Net) with matched parameter count and training protocol. Without this, it is impossible to determine whether gains derive from the Transformer’s global attention or simply from increased capacity, which is load-bearing for the central architectural claim.

Authors: We appreciate the referee's point regarding the need for capacity-matched ablations to isolate the benefit of the Transformer encoder. Our manuscript compares TransUNet to the standard U-Net and other established baselines under consistent training protocols. The improvements are observed across different medical segmentation tasks, supporting that the global context modeling from the Transformer contributes to better performance. To directly address this, we will include parameter counts for all models in the revised manuscript and add an ablation study using a deeper CNN backbone with comparable parameters where possible. revision: partial
Referee: [§4.1] §4.1 and results tables: Performance improvements are presented without statistical significance tests, confidence intervals, or multiple random-seed runs. Given the small size and high variability of medical datasets, modest metric gains may not be reliable, directly affecting the strength of the “superior performance” assertion.

Authors: We concur that providing statistical analysis would enhance the reliability of our results, particularly given the characteristics of medical imaging datasets. We will update the experimental section to include multiple runs with different random seeds, report standard deviations or confidence intervals, and conduct appropriate statistical tests to validate the significance of the performance gains in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with benchmark validation

full rationale

The paper introduces TransUNet as a hybrid CNN-Transformer encoder-decoder for medical segmentation and supports its claims solely through empirical results on public benchmarks (multi-organ and cardiac segmentation). No equations, predictions, or first-principles derivations appear in the provided text; the architecture is defined explicitly and then evaluated, without any fitted parameter being relabeled as a prediction or any load-bearing premise reducing to a self-citation. The superiority statements are presented as experimental outcomes rather than closed-loop theoretical necessities, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical superiority of a new neural architecture; no additional free parameters, axioms, or invented entities are introduced beyond standard deep-learning components.

pith-pipeline@v0.9.0 · 5566 in / 957 out tokens · 41374 ms · 2026-05-11T13:30:41.413755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/HierarchyEmergence.lean hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TransUNet achieves superior performances to various competing methods on different medical applications including multi-organ segmentation and cardiac segmentation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Weakly Supervised Segmentation as Semantic-Based Regularization
cs.CV 2026-05 unverdicted novelty 7.0

Differentiable fuzzy logic constraints fine-tune SAM to generate higher-quality pseudo-labels, enabling a second-stage model to reach state-of-the-art weakly supervised segmentation on Pascal VOC and REFUGE2, sometime...
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
cs.CV 2026-05 unverdicted novelty 7.0

RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
Camyla: Scaling Autonomous Research in Medical Image Segmentation
cs.AI 2026-04 unverdicted novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Nested Radially Monotone Polar Occupancy Estimation: Clinically-Grounded Optic Disc and Cup Segmentation for Glaucoma Screening
cs.CV 2026-04 unverdicted novelty 7.0

NPS-Net formulates optic disc and cup segmentation as nested radially monotone polar occupancy estimation to guarantee star-convexity, nesting, and high accuracy for glaucoma screening.
XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation
cs.CV 2026-03 unverdicted novelty 7.0

XAttnRes introduces cross-stage attention residuals that maintain a global feature history and selectively aggregate prior representations, improving medical image segmentation and performing on par with baselines eve...
Fast Single Nitrogen-Vacancy Center Ramsey Characterization using a Physics-Informed Neural Network
quant-ph 2026-03 conditional novelty 7.0

NVRNet uses pretrained simulation-based U-Nets with attention and parameter-efficient adapters, followed by a transformer estimator, to reconstruct clean Ramsey waveforms and infer hyperfine parameters from minimal-sw...
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
cs.CV 2026-03 unverdicted novelty 7.0

GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.
Towards Label-Free Single-Cell Phenotyping Using Multi-Task Learning
cs.CV 2026-05 unverdicted novelty 6.0

A hybrid CNN-transformer model with multi-task learning achieves 91.3% WBC classification accuracy and 0.72 Pearson correlation for CD16 expression regression from label-free DPC images, augmented by LLM-generated summaries.
Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI
cs.CV 2026-05 unverdicted novelty 6.0

A three-stage pipeline detects 16 landmarks, coarsely segments 12 labels, and refines them into 26 structures using landmark constraints to improve accuracy in subcortical MRI segmentation.
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

ESICA delivers state-of-the-art accuracy on a five-modality 3D medical segmentation benchmark while offering a compact variant with far fewer parameters.
MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention
cs.CV 2026-04 unverdicted novelty 6.0

MedFlowSeg is a conditional flow matching model for medical image segmentation that adds dual-branch spatial attention and frequency-aware attention to achieve more efficient inference than diffusion models while impr...
Rethinking Uncertainty in Segmentation: From Estimation to Decision
cs.CV 2026-04 unverdicted novelty 6.0

Uncertainty optimization alone misses most safety gains; a decision-stage deferral policy removes up to 80% segmentation errors at 25% pixel deferral with cross-dataset robustness, while calibration does not improve d...
Human Gaze-based Dual Teacher Guidance Learning for Semi-Supervised Medical Image Segmentation
eess.IV 2026-04 unverdicted novelty 6.0

HG-DTGL integrates human gaze as an extra teacher in mean-teacher learning via GazeMix, MGP module and Gaze Loss, reporting superior segmentation across ten organs on multiple modalities.
A fast and Generic Energy-Shifting Transformer for Hybrid Monte Carlo Radiotherapy Calculation
physics.med-ph 2026-04 unverdicted novelty 6.0

A hybrid Transformer-UNet model with energy-shifting inputs generates 6 MV LINAC dose maps from monoenergetic data, achieving over 98% gamma passing rate (3%/3mm) versus full Monte Carlo for prostate radiotherapy.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images
cs.CV 2026-05 unverdicted novelty 5.0

A TransUNet-based segmentation followed by texture comparison classifies fatty pancreas in ultrasound with 89.7% accuracy on a small clinical dataset.
Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction
cs.CV 2026-05 unverdicted novelty 5.0

A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
Domain-Adaptive Arrhythmia Classification Using a Hybrid Transformer on Wearable Heart Signals
eess.SP 2026-05 unverdicted novelty 5.0

A hybrid transformer combining ECG morphology and HRV features with MMD domain adaptation achieves 95% F1-macro on unseen wearable data for arrhythmia classification.
Memory-Efficient EDA Denoising via Knowledge Distillation for Wearable IoT Under Severe Motion Artifacts and Underwater Conditions
eess.SP 2026-05 conditional novelty 5.0

Knowledge distillation from a hybrid CNN-Transformer teacher to a depth-wise separable CNN student, combined with realistic motion and environmental augmentation, produces a 15x smaller EDA denoiser that cuts underwat...
SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT
cs.CV 2026-05 unverdicted novelty 5.0

SAIL integrates anatomical priors at the representation level with semantic features via fusion to produce more anatomically aligned attribution maps in OCT without altering existing explainability techniques.
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model
cs.CV 2026-04 unverdicted novelty 5.0

Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

MambaLiteUNet integrates Mamba into U-Net with adaptive fusion, local-global mixing, and cross-gated attention modules to reach 87.12% IoU and 93.09% Dice on skin lesion datasets while cutting parameters by 93.6%.
Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images
cs.CV 2026-04 unverdicted novelty 5.0

A semi-supervised framework using weighted knowledge distillation and SinusCycle-GAN refinement achieves 96.35% Dice score for maxillary sinus segmentation in panoramic X-rays from 2,511 patients.
EDU-Net: Retinal Pathological Fluid Segmentation in OCT Images with Multiscale Feature Fusion and Boundary Optimization
eess.IV 2026-04 unverdicted novelty 5.0

EDU-Net fuses multiscale local and global features with boundary optimization to achieve state-of-the-art segmentation of intraretinal and subretinal fluid in OCT images.
RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

RF-HiT uses rectified flow and a multi-scale hierarchical transformer to reach 91.27% Dice on ACDC and 87.40% on BraTS 2021 with only 10.14 GFLOPs, 13.6M parameters, and three inference steps.
UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

UniSemAlign aligns text and prototype representations with visual features to generate better supervision signals for semi-supervised segmentation, reporting Dice gains of up to 8.6% on CRAG with 10% labels.
Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
cs.CV 2026-04 unverdicted novelty 5.0

A ViT-LSTM spatiotemporal model detects surgical instrument handovers and classifies direction in videos, achieving F1 of 0.84 for detection and 0.72 mean F1 for direction on kidney transplant data.
CardioSAM: Topology-Aware Decoder Design for High-Precision Cardiac MRI Segmentation
cs.CV 2026-03 accept novelty 5.0

CardioSAM introduces a topology-aware decoder and boundary refinement module to elevate SAM's performance on cardiac MRI, achieving a 93.39% Dice score.
Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

Med-DisSeg uses a dispersive loss on batch representations plus adaptive multi-scale decoding to achieve state-of-the-art fine-grained segmentation on five medical imaging datasets.
RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction
cs.CV 2026-05 unverdicted novelty 4.0

RD-ViT matches or exceeds standard ViT segmentation accuracy on cardiac MRI using a shared recurrent block, fewer parameters, and less training data.
Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection
cs.CV 2026-05 unverdicted novelty 4.0

A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...
an interpretable vision transformer framework for automated brain tumor classification
cs.CV 2026-04 unverdicted novelty 4.0

Vision Transformer with CLAHE preprocessing, two-stage fine-tuning, MixUp/CutMix, EMA, TTA, and attention rollout achieves 99.29% accuracy and 99.25% macro F1 on four-class brain tumor MRI classification from 7023 scans.
MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer
cs.CV 2026-04 unverdicted novelty 4.0

MAE self-supervised pretraining of nnFormer yields higher Dice scores, faster convergence, and better generalization when labeled medical segmentation data is scarce.
Attention-Gated Convolutional Networks for Scanner-Agnostic Quality Assessment
cs.CV 2026-04 unverdicted novelty 4.0

A CNN-attention model achieves 99.2% accuracy on seen MRI sites and 75.5% on unseen heterogeneous sites for motion artifact quality assessment.
ASGNet: Adaptive Spectrum Guidance Network for Automatic Polyp Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

ASGNet combines a spectrum-guided non-local perception module, multi-source semantic extractor, and dense cross-layer decoder to outperform 21 prior methods on five polyp segmentation benchmarks.
PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

PBE-UNet adds scale-aware aggregation and progressive boundary expansion modules to U-Net and reports better segmentation performance than prior methods on four ultrasound datasets.
TAMISeg: Text-Aligned Multi-scale Medical Image Segmentation with Semantic Encoder Distillation
cs.CV 2026-04 unverdicted novelty 4.0

TAMISeg uses text prompts and semantic distillation from a frozen DINOv3 teacher inside a consistency-aware encoder and scale-adaptive decoder to outperform prior uni- and multi-modal methods on polyp and COVID-19 seg...
SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

SwinTextUNet integrates CLIP text guidance into Swin U-Net via cross-attention and convolutional fusion, achieving 86.47% Dice and 78.2% IoU on QaTaCOV19 medical image segmentation.
Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening
cs.CV 2026-04 unverdicted novelty 4.0

STGR framework integrates LLaMA-3-V and MedSAM via text-to-vision distillation and graph reasoning, achieving 81.5% DSC on LIDC-IDRI with under 1% parameter updates and high cross-fold stability.
Benchmarking CNN- and Transformer-Based Models for Surgical Instrument Segmentation in Robotic-Assisted Surgery
cs.CV 2026-04 unverdicted novelty 2.0

DeepLabV3 matches SegFormer performance in multi-class surgical instrument segmentation while convolutional baselines like UNet remain competitive on the SAR-RARP50 dataset.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 42 Pith papers · 2 internal anchors

[1]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[2]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

In: ICLR (2021) 12 J

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 12 J. Chen et al

work page 2021
[5]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Fu, S., Lu, Y., Wang, Y., Zhou, Y., Shen, W., Fishman, E., Yuille, A.: Domain adaptive relational reasoning for 3d multi-organ segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 656–666. Springer (2020)

work page 2020
[6]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[7]

IEEE transactions on medical imaging 37(12), 2663–2674 (2018)

Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging 37(12), 2663–2674 (2018)

work page 2018
[8]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

work page 2015
[9]

In: 3DV (2016)

Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)

work page 2016
[10]

MIDL (2018)

Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning where to look for the pancreas. MIDL (2018)

work page 2018
[11]

In: International Conference on Machine Learning

Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran, D.: Image transformer. In: International Conference on Machine Learning. pp. 4055–

work page
[12]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

work page 2015
[13]

Medical image analysis 53, 197–207 (2019)

Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker, B., Rueck- ert, D.: Attention gated networks: Learning to leverage salient regions in medical images. Medical image analysis 53, 197–207 (2019)

work page 2019
[14]

In: Advances in neural information processing systems

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)

work page 2017
[15]

In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 7794–7803 (2018)

work page 2018
[16]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Yu, L., Cheng, J.Z., Dou, Q., Yang, X., Chen, H., Qin, J., Heng, P.A.: Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 287–295. Springer (2017)

work page 2017
[17]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ seg- mentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8280–8289 (2018)

work page 2018
[18]

Rethinking semantic segmen- tation from a sequence-to-sequence perspective with trans- formers

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)

work page arXiv 2012
[19]

In: International conference on medical image computing and computer-assisted intervention

Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A ﬁxed- point model for pancreas segmentation in abdominal ct scans. In: International conference on medical image computing and computer-assisted intervention. pp. 693–701. Springer (2017)

work page 2017
[20]

In: Deep Learning in Medical Im- Title Suppressed Due to Excessive Length 13 age Analysis and Multimodal Learning for Clinical Decision Support, pp

Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Im- Title Suppressed Due to Excessive Length 13 age Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Springer (2018)

work page 2018