pith. sign in

arxiv: 2606.29463 · v1 · pith:2OPSMCDWnew · submitted 2026-06-28 · 💻 cs.CV

CellDETR: A Detection-Guided Framework for Scalable Cell Representation Learning from Histopathology Images

Pith reviewed 2026-06-30 07:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords cell representation learninghistopathologywhole-slide imagesobject detectioncontrastive learningcell classificationPanNukespatial transcriptomics
0
0 comments X

The pith

CellDETR adds location feature decoupling and box-constrained attention to Deformable DETR to extract cell embeddings from whole-slide images that outperform prior methods on PanNuke and transfer across datasets after pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CellDETR is a detection-guided framework built on Deformable DETR for learning cell-level representations from histopathology whole-slide images. It introduces location feature decoupling and a box-constrained attention mechanism to enable automated extraction of cell embeddings. These modifications produce better supervised cell classification results on the PanNuke dataset than existing approaches. The same model supports contrastive pretraining on unlabeled slides and pretraining with Xenium spatial transcriptomics annotations, both of which improve downstream classification and allow accurate cell classification on new datasets.

Core claim

CellDETR, constructed by incorporating location feature decoupling and box-constrained attention into Deformable DETR, automates the extraction of cell-level embeddings from WSIs. This yields superior supervised cell classification on PanNuke compared to state-of-the-art approaches. Incorporating contrastive learning enables pretraining on unlabeled WSIs that enhances downstream classification, while pretraining with Xenium-derived annotations achieves accurate cross-dataset cell classification, indicating the embeddings' transferability and biological relevance.

What carries the argument

location feature decoupling and box-constrained attention mechanism, which separate positional information and restrict attention to detected bounding boxes to produce cell-specific embeddings.

If this is right

  • Outperforms existing state-of-the-art methods in supervised cell classification on PanNuke data.
  • A CellDETR-based contrastive pretraining model on unlabeled WSIs improves downstream cell classification performance.
  • Pretraining with Xenium spatial transcriptomics-derived cell annotations enables accurate cross-dataset cell classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The learned embeddings could support more detailed mapping of cell types and interactions within tissue microenvironments.
  • The detection-guided design may extend to representation learning for other small-scale objects in imaging domains beyond pathology.
  • Cross-dataset transfer success suggests the embeddings capture features general enough for use in settings with limited labeled data.

Load-bearing premise

The performance gains in cell embeddings and transfer come primarily from the location feature decoupling and box-constrained attention rather than from training choices, hyperparameters, or the base Deformable DETR architecture.

What would settle it

Training a standard Deformable DETR on PanNuke data without location feature decoupling or box-constrained attention and obtaining cell classification accuracy equal to or higher than CellDETR.

Figures

Figures reproduced from arXiv: 2606.29463 by Chulin Sha, Guojun Li, Shikang Zhang, Yicong Mao.

Figure 1
Figure 1. Figure 1: Overall Design of Cell DETR framework. We propose CellDETR, a deformable DETR￾based detection-guided framework that extracts transferable cell-level embeddings from pathology images through feature decouple and box-constrained attention design, enabling supervised, self￾supervised, and spatial-transcriptomics-informed cell representation learning from WSIs. 2.2 DETR and Object Detection Models DETR [18] in… view at source ↗
Figure 2
Figure 2. Figure 2: The architecture of CellDETR. (a) The overall model architecture, where the backbone, encoder, and Detection Decoder all follow the standard Deformable DETR structure. (b) Detailed structure of location feature decouple representation decoder, which uses a linear projection and an initialization layer to separate positional information from feature embeddings. (c) Illustration of the box-constrained attent… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of improved cell-specific feature attention of location decouple module and box-constraint attention module. We present the attention regions of the last two decoder layers of deformable attention. The attention map shows both location decouple module and box-constraint attention module are effective in cell-specific feature extraction. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Contrastive learning implementation and results. (a) Illustration of cell-level contrastive learning design for CellDETR. The training section includes the loss descent curve and a similarity heatmap between the anchor and ground truth in the contrastive learning representation space. Both training and validation losses decrease normally, and the high values along the diagonal in the similarity heatmap ind… view at source ↗
Figure 5
Figure 5. Figure 5: Semi-supervised Dataset Construction and Application.In the data construction phase, we obtain single-cell gene expression profiles from Xenium spatial transcriptomics and derive cell labels using the annotation tool (Tacco[25]). Nuclear boundaries are segmented from pathological images via CellViT, followed by positional alignment to map cell type information onto the segmen￾tation masks. The annotated ce… view at source ↗
read the original abstract

Recent advances in pathology foundation models have substantially improved patch and slide level representation learning from whole-slide images (WSIs).However, cell-level representations learning remain underexplored, limiting cell resolved interpretability, biological discovery, and clinical translation. We propose CellDETR, a detection-guided framework built on Deformable DETR for scalable cell representation learning from WSIs. By introducing location feature decoupling and box-constrained attention mechanism, CellDETR enables automated extraction of cell-level embeddings, and outperform existing state-of-the-art methods in supervised cell classification on PanNuke data. In addition, by incorporating contrastive learning design, we build a CellDETR-based pretraining model for scalable cell representation learning from unlabeled WSIs, which improves downstream cell classification performance. Furthermore, we show that after pretraining with Xenium spatial transcriptomics-derived cell annotations, CellDETR achieves accurate cross-dataset cell classification, demonstrating the transferability and biological relevance of the learned cell embeddings. Together, CellDETR provides a scalable route toward general cell-level representation learning framework for interpretable computational patholog

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CellDETR, a detection-guided framework extending Deformable DETR with two modifications—location feature decoupling and box-constrained attention—for automated extraction of cell-level embeddings from whole-slide histopathology images. It claims these changes enable outperformance of existing methods on supervised cell classification using the PanNuke dataset, that adding a contrastive learning component yields a scalable pretraining approach on unlabeled WSIs that boosts downstream classification, and that pretraining with Xenium spatial transcriptomics annotations produces accurate cross-dataset cell classification, demonstrating transferability.

Significance. If the empirical claims hold after proper controls, the work could advance cell-resolved representation learning in computational pathology beyond current patch- or slide-level foundation models, supporting more interpretable biological discovery. No machine-checked proofs, reproducible code releases, or parameter-free derivations are described.

major comments (2)
  1. [Abstract] Abstract: the central attribution of performance gains and transferability to 'location feature decoupling and box-constrained attention' is unsupported because no ablation studies, quantitative metrics, baseline comparisons to unmodified Deformable DETR, or statistical tests are referenced; without these, the claim that the modifications are the primary drivers cannot be evaluated.
  2. [Abstract] Abstract: the statements of outperformance on PanNuke, improvement from contrastive pretraining, and cross-dataset accuracy with Xenium annotations are presented without any reported numbers, error bars, dataset splits, or implementation details, rendering the soundness of the empirical claims impossible to assess from the provided text.
minor comments (1)
  1. [Abstract] Abstract contains grammatical issues ('remain underexplored', 'outperform existing', 'provides a scalable route toward general ... framework') that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and agree that revisions to the abstract are warranted to better support the claims with references to the experimental evidence presented in the main text.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central attribution of performance gains and transferability to 'location feature decoupling and box-constrained attention' is unsupported because no ablation studies, quantitative metrics, baseline comparisons to unmodified Deformable DETR, or statistical tests are referenced; without these, the claim that the modifications are the primary drivers cannot be evaluated.

    Authors: The main manuscript includes ablation studies (Section 4.2), direct comparisons against the unmodified Deformable DETR baseline, quantitative metrics, and statistical tests demonstrating the contribution of the proposed modifications. We will revise the abstract to explicitly reference these ablations and key supporting results. revision: yes

  2. Referee: [Abstract] Abstract: the statements of outperformance on PanNuke, improvement from contrastive pretraining, and cross-dataset accuracy with Xenium annotations are presented without any reported numbers, error bars, dataset splits, or implementation details, rendering the soundness of the empirical claims impossible to assess from the provided text.

    Authors: We agree that the abstract would benefit from including representative quantitative results. The full manuscript reports these details with numbers, error bars, dataset splits, and implementation information in Sections 4.1, 4.3, and 4.4. We will update the abstract to incorporate key performance metrics from the PanNuke and Xenium experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with reported experimental outcomes

full rationale

The paper introduces CellDETR as a detection-guided framework extending Deformable DETR, with added location feature decoupling and box-constrained attention. All claims concern empirical performance on PanNuke classification, contrastive pretraining on WSIs, and cross-dataset transfer after Xenium pretraining. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central results are framed as experimental measurements rather than quantities defined in terms of themselves, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly relies on standard deep-learning assumptions about transformer attention and contrastive objectives, but these are not enumerated as ad-hoc to the paper.

pith-pipeline@v0.9.1-grok · 5733 in / 1256 out tokens · 39695 ms · 2026-06-30T07:22:36.554729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021. URL...

  2. [2]

    Chen, Tong Ding, Ming Y

    Richard J. Chen, Tong Ding, Ming Y . Lu, Drew F. K. Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H. Song, Muhammad Shaban, et al. To- wards a general-purpose foundation model for computational pathology.Nature Medicine,

  3. [3]

    URL https://www.nature.com/articles/ s41591-024-02857-3

    doi: 10.1038/s41591-024-02857-3. URL https://www.nature.com/articles/ s41591-024-02857-3

  4. [4]

    Kunz, Matthew C

    Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Ellen Yang, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan H. Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Mil...

  5. [5]

    Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Sheng Wang, and Hoifung Poon

    Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, Yanbo Xu, Mu Wei, Wenhui Wang, Shuming Ma, Furu Wei, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Jaylen Rosemon, Tucker Bower, Soohee Lee, Roshanthi Weerasinghe, Bill J. Wright, Ari Robicsek, Brian Piening, Carlo Bifulco, Shen...

  6. [6]

    Peter Neidlinger, Omar S. M. El Nahhas, Hannah Sophie Muti, Tim Lenz, et al. Benchmarking foundation models as feature extractors for weakly-supervised computational pathology.Nature Biomedical Engineering, 2025. doi: 10.1038/s41551-025-01516-3. arXiv:2408.15823

  7. [7]

    A clinical benchmark of public self-supervised pathology foundation models.Nature Communications, 16(1):3640,

    Gabriele Campanella, Shengjia Chen, Manan Singh, Ruchika Verma, et al. A clinical benchmark of public self-supervised pathology foundation models.Nature Communications, 16(1):3640,

  8. [8]

    doi: 10.1038/s41467-025-58796-1

  9. [9]

    Ding et al

    K. Ding et al. A large-scale synthetic pathological dataset for deep learning-enabled segmenta- tion of breast cancer.Scientific Data, 10(1):231, 2023. doi: 10.1038/s41597-023-02125-y

  10. [10]

    CellViT: Vision transformers for precise cell segmentation and classification.Medical Image Analysis, 94: 103143, 2024

    Fabian Hörst, Moritz Rempe, Lukas Heine, Constantin Seibold, Julius Keyl, Giulia Baldini, Selma Ugurel, Jens Siveke, Barbara Grünwald, Jan Egger, and Jens Kleesiek. CellViT: Vision transformers for precise cell segmentation and classification.Medical Image Analysis, 94: 103143, 2024. doi: 10.1016/j.media.2024.103143. URL https://www.sciencedirect. com/sci...

  11. [11]

    CellViT++: Energy-efficient and adaptive cell segmentation and classification using foundation models, 2025

    Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, and Jens Kleesiek. CellViT++: Energy-efficient and adaptive cell segmentation and classification using foundation models, 2025. URLhttps://arxiv.org/abs/2501.05269

  12. [12]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008,

  13. [13]

    URL https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  14. [14]

    Deformable DETR: Deformable transformers for end-to-end object detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. InInternational Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id= gZ9hCDWe6ke. 10

  15. [15]

    Pannuke dataset extension, insights and baselines.arXiv preprint arXiv:2003.10778, 2020

    Jevgenij Gamper, Navid Alemi Koohbanani, Simon Graham, Mostafa Jahanifar, Syed Ali Khurram, Ayesha Azam, Katherine Hewitt, and Nasir Rajpoot. Pannuke dataset extension, insights and baselines.arXiv preprint arXiv:2003.10778, 2020

  16. [16]

    Marco Salas et al

    S. Marco Salas et al. Optimizing xenium in situ data utility by quality assess- ment and best-practice analysis workflows.Nature Methods, 22:813–823, 2025. doi: 10.1038/s41592-025-02617-2. URL https://www.nature.com/articles/ s41592-025-02617-2

  17. [17]

    Ståhl, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, Michaela Asp, Jakub O

    Patrik L. Ståhl, Fredrik Salmén, Sanja Vickovic, Anna Lundmark, José Fernández Navarro, Jens Magnusson, Stefania Giacomello, Michaela Asp, Jakub O. Westholm, Mikael Huss, Annelie Mollbrink, Sten Linnarsson, Simone Codeluppi, Åsa Borg, Fredrik Pontén, Paul Igor Costea, Pelin Sahlén, Jan Mulder, Jonas Nilsson, and Joakim Lundeberg. Visualization and analysi...

  18. [18]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2961–2969, 2017

  19. [19]

    Micro-net: A unified model for segmentation of various objects in microscopy images.Medical Image Analysis, 52:160–173, 2019

    Shan E Ahmed Raza, Linda Cheung, Muhammad Shaban, Simon Graham, David Epstein, Stella Pelengaris, Michael Khan, and Nasir M Rajpoot. Micro-net: A unified model for segmentation of various objects in microscopy images.Medical Image Analysis, 52:160–173, 2019. doi: 10.1016/j.media.2018.12.003

  20. [20]

    Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019

    Simon Graham, Quoc Dang Vu, Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang, Jin Tae Kwak, and Nasir Rajpoot. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images.Medical Image Analysis, 58:101563, 2019. doi: 10.1016/j.media. 2019.101563

  21. [21]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020. doi: 10.1007/978-3-030-58452-8_13. URLhttps://arxiv.org/abs/2005.12872

  22. [22]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

  23. [23]

    Nuclear morphology is a deep learning biomarker of cellular senescence and predicts clinical outcomes in cancer.Nature Aging, 2022

    Ines Heckenbach et al. Nuclear morphology is a deep learning biomarker of cellular senescence and predicts clinical outcomes in cancer.Nature Aging, 2022. doi: 10.1038/ s43587-022-00263-3

  24. [24]

    DAB-DETR: Dynamic anchor boxes are better queries for DETR

    Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://arxiv.org/abs/2201.12329

  25. [25]

    Decoupled detr: Spatially disentangling localization and classification for improved end-to-end object detection

    Minghan Zhang et al. Decoupled detr: Spatially disentangling localization and classification for improved end-to-end object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), page ..., 2023. URL https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_Decoupled_ DETR_Spatially_Disentangling_Localization_and_Cla...

  26. [26]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. InarXiv preprint arXiv:1807.03748, 2018

  27. [27]

    Cell-detr: Efficient cell detection and classification in wsis with transformers

    Oscar Pina et al. Cell-detr: Efficient cell detection and classification in wsis with transformers. InMedical Imaging with Deep Learning (MIDL), 2024

  28. [28]

    Tacco: a novel method for cell type annotation in spatial transcriptomics

    Maria Efremova et al. Tacco: a novel method for cell type annotation in spatial transcriptomics. Nature Methods, 2024. doi: 10.1038/s41592-024-02291-8. 11 A Supplementary A.1 Two training mode In thesequential trainingstrategy, predicted boxes from the detection decoder are directly passed to the representation decoder, allowing the model to learn detecti...