Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

Chongyang Gao; Chunhui Zhang; Jiang Gui; Siting Li; Weiyi Wu; Xingjian Diao; Xinwen Xu

arxiv: 2406.09333 · v3 · pith:C5ELYSFQnew · submitted 2024-06-13 · 💻 cs.CV

Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

Weiyi Wu , Xingjian Diao , Chunhui Zhang , Chongyang Gao , Xinwen Xu , Siting Li , Jiang Gui This is my paper

Pith reviewed 2026-05-24 00:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords whole slide imageshierarchical representationsdigital pathologyspatial preservationmulti-scale featuresattention networksslide classificationimage segmentation

0 comments

The pith

SPAN constructs multi-scale representations from single-scale inputs to preserve spatial context in whole slide images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Pyramid Attention Networks (SPAN) to address the challenges of gigapixel whole slide images, which have sparse informative regions and intrinsic hierarchical structures. Existing methods often process patches independently or reshape them, losing spatial relationships. SPAN instead builds multi-scale features directly while allocating computation to key areas and maintaining spatial context. It applies this to slide classification via SPAN-MIL and segmentation via SPAN-UNet, with evaluations showing gains on public datasets. A reader would care because accurate modeling of these large medical images depends on respecting their natural pyramid organization rather than flattening it.

Core claim

SPAN is a hierarchical framework that preserves spatial relationships in WSIs by constructing multi-scale representations directly from single-scale inputs, enabling precise modeling of the intrinsic hierarchical pyramid structure without the distortions from independent patch processing or reshaping.

What carries the argument

Sparse Pyramid Attention Networks (SPAN), a hierarchical attention mechanism that allocates computation to informative regions while building pyramid representations from single-scale inputs.

If this is right

SPAN-MIL improves slide-level classification accuracy by capturing contextual hierarchical relationships.
SPAN-UNet yields better patch-level segmentation by avoiding spatial distortions in feature construction.
Architectural inductive biases for hierarchy lead to measurable gains across multiple public pathology datasets.
Hierarchical representations support both classification and segmentation without requiring separate pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-to-multi-scale construction principle could reduce information loss in other domains with pyramid-like data, such as high-resolution satellite imagery.
Integrating SPAN-style attention with existing multiple-instance learning pipelines might improve efficiency when informative regions are extremely sparse.
Explicit pyramid preservation may lessen reliance on heavy data augmentation for training pathology models.

Load-bearing premise

Whole slide images possess intrinsic hierarchical pyramid representations that can be faithfully recovered by constructing multi-scale features directly from single-scale inputs.

What would settle it

A head-to-head test on a standard WSI dataset in which SPAN produces equal or lower accuracy than independent patch processing on both slide classification and patch segmentation tasks.

Figures

Figures reproduced from arXiv: 2406.09333 by Chongyang Gao, Chunhui Zhang, Jiang Gui, Siting Li, Weiyi Wu, Xingjian Diao, Xinwen Xu.

**Figure 2.** Figure 2: Schematic of the proposed sparse window attention mechanism. The input WSI is [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The figure illustrates the overall architecture of the SPAN model. The input WSI first passes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy and memory usage of SPAN with varying window sizes from [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Whole slide images (WSIs) pose fundamental computational challenges due to their gigapixel resolution and the sparse distribution of informative regions. Existing approaches often treat image patches independently or reshape them in ways that distort spatial context, thereby obscuring the hierarchical pyramid representations intrinsic to WSIs. We introduce Sparse Pyramid Attention Networks (SPAN), a hierarchical framework that preserves spatial relationships while allocating computation to informative regions. SPAN constructs multi-scale representations directly from single-scale inputs, enabling precise hierarchical modeling of WSI data. We demonstrate SPAN's versatility through two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation. Comprehensive evaluations across multiple public datasets show that SPAN effectively captures hierarchical structure and contextual relationships. Our results provide clear evidence that architectural inductive biases and hierarchical representations enhance both slide-level and patch-level performance. By addressing key computational challenges in WSI analysis, SPAN provides an effective framework for computational pathology and demonstrates important design principles for large-scale medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPAN introduces a spatial-preserving way to build multi-scale features for WSIs from single-scale patches, but the abstract shows no numbers or checks on whether the hierarchy is actually recovered.

read the letter

The paper's main contribution is SPAN, which uses attention to construct pyramid-style multi-scale representations directly from single-scale WSI patches instead of processing them independently or reshaping them. They show two versions: one for slide-level MIL classification and one for segmentation in a UNet style. This targets the real issue of gigapixel images with sparse informative regions and tries to keep spatial relationships intact across scales.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Sparse Pyramid Attention Networks (SPAN), a hierarchical framework for whole slide images (WSIs) in digital pathology that constructs multi-scale representations directly from single-scale inputs to preserve spatial relationships and allocate computation to informative regions. It introduces two variants, SPAN-MIL for slide classification and SPAN-UNet for segmentation, and claims that evaluations on public datasets demonstrate improved capture of hierarchical structure and contextual relationships, providing evidence that architectural inductive biases enhance slide-level and patch-level performance.

Significance. If substantiated, the work could advance computational pathology by offering an attention-based approach to hierarchical WSI modeling that avoids spatial distortions from independent patch processing. The dual variants illustrate versatility across tasks, and the focus on inductive biases for large-scale medical images aligns with ongoing needs in the field.

major comments (2)

[Abstract] Abstract: the central claim that SPAN 'constructs multi-scale representations directly from single-scale inputs' to enable 'precise hierarchical modeling' of 'intrinsic' WSI pyramid representations lacks any quantitative check (e.g., feature alignment, ablation vs. explicit multi-resolution inputs, or distortion metrics). This assumption is load-bearing for attributing gains to faithful hierarchy recovery rather than attention or sparsity mechanisms alone.
[Abstract] Abstract: the assertions of 'comprehensive evaluations across multiple public datasets' and 'clear evidence' that hierarchical representations enhance performance are unsupported by any reported metrics, baselines, error bars, data splits, or statistical tests, leaving the primary empirical claim without visible grounding.

minor comments (1)

[Abstract] The abstract refers to 'multiple public datasets' without naming them or describing splits, which would help assess the scope of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the abstract to ensure claims are more precisely grounded in the reported experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that SPAN 'constructs multi-scale representations directly from single-scale inputs' to enable 'precise hierarchical modeling' of 'intrinsic' WSI pyramid representations lacks any quantitative check (e.g., feature alignment, ablation vs. explicit multi-resolution inputs, or distortion metrics). This assumption is load-bearing for attributing gains to faithful hierarchy recovery rather than attention or sparsity mechanisms alone.

Authors: We agree that direct quantitative validation of the multi-scale construction would strengthen attribution of gains to the hierarchical inductive bias. The current experiments demonstrate performance benefits, but we will add an ablation comparing SPAN to explicit multi-resolution inputs along with feature alignment and distortion metrics in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the assertions of 'comprehensive evaluations across multiple public datasets' and 'clear evidence' that hierarchical representations enhance performance are unsupported by any reported metrics, baselines, error bars, data splits, or statistical tests, leaving the primary empirical claim without visible grounding.

Authors: The full manuscript reports results across multiple public datasets with metrics, baselines, and data splits in the Experiments section. We will revise the abstract to reference these results more explicitly, incorporate error bars and statistical tests, and avoid overstatement of the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: SPAN is an independent architectural proposal evaluated on external data

full rationale

The paper introduces SPAN as a new hierarchical framework that constructs multi-scale representations from single-scale inputs via attention-based mechanisms. No derivation step reduces by construction to fitted parameters, self-defined quantities, or load-bearing self-citations; the central claims rest on empirical evaluations across public datasets rather than on any renaming, ansatz smuggling, or uniqueness theorem imported from the authors' prior work. The assumption that WSIs possess recoverable intrinsic pyramid hierarchies is presented as a modeling premise, not as a result derived from the model's own outputs. This is the most common honest finding for an architectural proposal paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of a new attention-based hierarchical architecture whose specific layer choices, attention mechanisms, and scale-construction rules are introduced by the authors; these constitute free parameters in the model definition. No invented physical entities are postulated.

free parameters (1)

pyramid scale factors and attention head counts
Design choices in the multi-scale construction and attention modules that are selected to enable the hierarchical modeling.

axioms (1)

domain assumption WSIs contain intrinsic hierarchical pyramid representations that are obscured by independent patch processing
Invoked in the abstract to motivate the need for spatial-preserving multi-scale construction.

pith-pipeline@v0.9.0 · 5711 in / 1317 out tokens · 20564 ms · 2026-05-24T00:18:38.121671+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SPAN constructs multi-scale representations directly from single-scale inputs... sparse pyramid attention architecture that hierarchically focuses on informative regions... shifted windows... downsampling by a factor of approximately 4
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ablation study results... importance of the shifted-window mechanism and hierarchical downsampling through convolutional layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association

Esther Abels, Liron Pantanowitz, Famke Aeffner, Mark D Zarella, Jeroen van der Laak, Mar- ilyn M Bui, Venkata NP Vemuri, Anil V Parwani, Jeff Gibbs, Emmanuel Agosto-Arroyo, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association. The Journal of pathology,...

work page 2019
[2]

Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017

work page 2017
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Bracs: A dataset for breast carcinoma subtyping in h&e histology images

Nadia Brancati, Anna Maria Anniciello, Pushpak Pati, Daniel Riccio, Giosuè Scognamiglio, Guillaume Jaume, Giuseppe De Pietro, Maurizio Di Bonito, Antonio Foncubierta, Gerardo Botti, et al. Bracs: A dataset for breast carcinoma subtyping in h&e histology images. Database, 2022:baac093, 2022

work page 2022
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[6]

Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8):1301–1309, 2019

work page 2019
[7]

Spconv: Spatially sparse convolution library

Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

work page 2022
[8]

Transformer-XL: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 2978–

work page
[9]

doi: 10.18653/v1/P19-1285

Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285

work page doi:10.18653/v1/p19-1285 2019
[10]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1

work page 2024
[11]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https...

work page 2021
[13]

Alvarez, Jan Kautz, and Pavlo Molchanov

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=kB4yBiNmXX

work page 2024
[14]

Spatial pyramid pooling in deep convolutional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015. 11

work page 1904
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[16]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298, 2024

work page arXiv 2024
[17]

Attention-based deep multiple instance learning

Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018

work page 2018
[18]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[19]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b...

work page 2012
[20]

Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning

Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2021

work page 2021
[21]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

work page 2017
[22]

Sparse convolutional neural networks

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015

work page 2015
[23]

Learning to encode position for transformer with continuous dynamical model

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model. In International conference on machine learning, pages 6327–6335. PMLR, 2020

work page 2020
[24]

Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020. URL https://openreview.net/forum?id=SyxS0T4tvS

work page 2020
[25]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021
[26]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022
[27]

Distinctive image features from scale-invariant keypoints

David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004

work page 2004
[28]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

work page arXiv 2023
[29]

Data-efficient and weakly supervised computational pathology on whole-slide images

Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021

work page 2021
[30]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations,

work page
[31]

URL https://openreview.net/forum?id=R8sQPpGCv0. 12

work page
[32]

Transmil: Transformer based correlated multiple instance learning for whole slide image classification

Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

work page 2021
[33]

Self-attention with relative position represen- tations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen- tations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, 2018. Association for Computational Ling...

work page doi:10.18653/v1/n18-2074 2018
[34]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[35]

Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion

Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4078–4087, 2023

work page 2023
[36]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

work page 2017
[37]

Deep high-resolution representation learning for visual recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10): 3349–3364, 2020

work page 2020
[38]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019

work page internal anchor Pith review arXiv 1909
[39]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021

work page 2021
[40]

Nyströmformer: A nyström-based algorithm for approximating self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021

work page 2021
[41]

Focal attention for long-range interactions in vision transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal attention for long-range interactions in vision transformers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30008–30022. Curran Associates, Inc., 2...

work page 2021
[42]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33: 17283–17297, 2020

work page 2020
[43]

Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification

Hongrun Zhang, Yanda Meng, Yitian Zhao, Yihong Qiao, Xiaoyun Yang, Sarah E Coupland, and Yalin Zheng. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18802–18812, 2022. 13

work page 2022

[1] [1]

Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association

Esther Abels, Liron Pantanowitz, Famke Aeffner, Mark D Zarella, Jeroen van der Laak, Mar- ilyn M Bui, Venkata NP Vemuri, Anil V Parwani, Jeff Gibbs, Emmanuel Agosto-Arroyo, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association. The Journal of pathology,...

work page 2019

[2] [2]

Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017

work page 2017

[3] [3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [4]

Bracs: A dataset for breast carcinoma subtyping in h&e histology images

Nadia Brancati, Anna Maria Anniciello, Pushpak Pati, Daniel Riccio, Giosuè Scognamiglio, Guillaume Jaume, Giuseppe De Pietro, Maurizio Di Bonito, Antonio Foncubierta, Gerardo Botti, et al. Bracs: A dataset for breast carcinoma subtyping in h&e histology images. Database, 2022:baac093, 2022

work page 2022

[5] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[6] [6]

Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8):1301–1309, 2019

work page 2019

[7] [7]

Spconv: Spatially sparse convolution library

Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

work page 2022

[8] [8]

Transformer-XL: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 2978–

work page

[9] [9]

doi: 10.18653/v1/P19-1285

Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285

work page doi:10.18653/v1/p19-1285 2019

[10] [10]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1

work page 2024

[11] [11]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https...

work page 2021

[13] [13]

Alvarez, Jan Kautz, and Pavlo Molchanov

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=kB4yBiNmXX

work page 2024

[14] [14]

Spatial pyramid pooling in deep convolutional networks for visual recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015. 11

work page 1904

[15] [15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[16] [16]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298, 2024

work page arXiv 2024

[17] [17]

Attention-based deep multiple instance learning

Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018

work page 2018

[18] [18]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020

[19] [19]

Imagenet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b...

work page 2012

[20] [20]

Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning

Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2021

work page 2021

[21] [21]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

work page 2017

[22] [22]

Sparse convolutional neural networks

Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015

work page 2015

[23] [23]

Learning to encode position for transformer with continuous dynamical model

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model. In International conference on machine learning, pages 6327–6335. PMLR, 2020

work page 2020

[24] [24]

Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020. URL https://openreview.net/forum?id=SyxS0T4tvS

work page 2020

[25] [25]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

work page 2021

[26] [26]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022

[27] [27]

Distinctive image features from scale-invariant keypoints

David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004

work page 2004

[28] [28]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

work page arXiv 2023

[29] [29]

Data-efficient and weakly supervised computational pathology on whole-slide images

Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021

work page 2021

[30] [30]

Train short, test long: Attention with linear biases enables input length extrapolation

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations,

work page

[31] [31]

URL https://openreview.net/forum?id=R8sQPpGCv0. 12

work page

[32] [32]

Transmil: Transformer based correlated multiple instance learning for whole slide image classification

Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

work page 2021

[33] [33]

Self-attention with relative position represen- tations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen- tations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, 2018. Association for Computational Ling...

work page doi:10.18653/v1/n18-2074 2018

[34] [34]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[35] [35]

Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion

Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4078–4087, 2023

work page 2023

[36] [36]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

work page 2017

[37] [37]

Deep high-resolution representation learning for visual recognition

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10): 3349–3364, 2020

work page 2020

[38] [38]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019

work page internal anchor Pith review arXiv 1909

[39] [39]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021

work page 2021

[40] [40]

Nyströmformer: A nyström-based algorithm for approximating self-attention

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021

work page 2021

[41] [41]

Focal attention for long-range interactions in vision transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal attention for long-range interactions in vision transformers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30008–30022. Curran Associates, Inc., 2...

work page 2021

[42] [42]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33: 17283–17297, 2020

work page 2020

[43] [43]

Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification

Hongrun Zhang, Yanda Meng, Yitian Zhao, Yihong Qiao, Xiaoyun Yang, Sarah E Coupland, and Yalin Zheng. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18802–18812, 2022. 13

work page 2022