pith. sign in

arxiv: 2406.09333 · v3 · pith:C5ELYSFQnew · submitted 2024-06-13 · 💻 cs.CV

Learning Spatial-Preserving Hierarchical Representations for Digital Pathology

Pith reviewed 2026-05-24 00:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords whole slide imageshierarchical representationsdigital pathologyspatial preservationmulti-scale featuresattention networksslide classificationimage segmentation
0
0 comments X

The pith

SPAN constructs multi-scale representations from single-scale inputs to preserve spatial context in whole slide images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sparse Pyramid Attention Networks (SPAN) to address the challenges of gigapixel whole slide images, which have sparse informative regions and intrinsic hierarchical structures. Existing methods often process patches independently or reshape them, losing spatial relationships. SPAN instead builds multi-scale features directly while allocating computation to key areas and maintaining spatial context. It applies this to slide classification via SPAN-MIL and segmentation via SPAN-UNet, with evaluations showing gains on public datasets. A reader would care because accurate modeling of these large medical images depends on respecting their natural pyramid organization rather than flattening it.

Core claim

SPAN is a hierarchical framework that preserves spatial relationships in WSIs by constructing multi-scale representations directly from single-scale inputs, enabling precise modeling of the intrinsic hierarchical pyramid structure without the distortions from independent patch processing or reshaping.

What carries the argument

Sparse Pyramid Attention Networks (SPAN), a hierarchical attention mechanism that allocates computation to informative regions while building pyramid representations from single-scale inputs.

If this is right

  • SPAN-MIL improves slide-level classification accuracy by capturing contextual hierarchical relationships.
  • SPAN-UNet yields better patch-level segmentation by avoiding spatial distortions in feature construction.
  • Architectural inductive biases for hierarchy lead to measurable gains across multiple public pathology datasets.
  • Hierarchical representations support both classification and segmentation without requiring separate pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same single-to-multi-scale construction principle could reduce information loss in other domains with pyramid-like data, such as high-resolution satellite imagery.
  • Integrating SPAN-style attention with existing multiple-instance learning pipelines might improve efficiency when informative regions are extremely sparse.
  • Explicit pyramid preservation may lessen reliance on heavy data augmentation for training pathology models.

Load-bearing premise

Whole slide images possess intrinsic hierarchical pyramid representations that can be faithfully recovered by constructing multi-scale features directly from single-scale inputs.

What would settle it

A head-to-head test on a standard WSI dataset in which SPAN produces equal or lower accuracy than independent patch processing on both slide classification and patch segmentation tasks.

Figures

Figures reproduced from arXiv: 2406.09333 by Chongyang Gao, Chunhui Zhang, Jiang Gui, Siting Li, Weiyi Wu, Xingjian Diao, Xinwen Xu.

Figure 1
Figure 1. Figure 1: Comparison of our proposed hierarchical approach with conventional patch-based methods. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic of the proposed sparse window attention mechanism. The input WSI is [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The figure illustrates the overall architecture of the SPAN model. The input WSI first passes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy and memory usage of SPAN with varying window sizes from [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Whole slide images (WSIs) pose fundamental computational challenges due to their gigapixel resolution and the sparse distribution of informative regions. Existing approaches often treat image patches independently or reshape them in ways that distort spatial context, thereby obscuring the hierarchical pyramid representations intrinsic to WSIs. We introduce Sparse Pyramid Attention Networks (SPAN), a hierarchical framework that preserves spatial relationships while allocating computation to informative regions. SPAN constructs multi-scale representations directly from single-scale inputs, enabling precise hierarchical modeling of WSI data. We demonstrate SPAN's versatility through two variants: SPAN-MIL for slide classification and SPAN-UNet for segmentation. Comprehensive evaluations across multiple public datasets show that SPAN effectively captures hierarchical structure and contextual relationships. Our results provide clear evidence that architectural inductive biases and hierarchical representations enhance both slide-level and patch-level performance. By addressing key computational challenges in WSI analysis, SPAN provides an effective framework for computational pathology and demonstrates important design principles for large-scale medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Sparse Pyramid Attention Networks (SPAN), a hierarchical framework for whole slide images (WSIs) in digital pathology that constructs multi-scale representations directly from single-scale inputs to preserve spatial relationships and allocate computation to informative regions. It introduces two variants, SPAN-MIL for slide classification and SPAN-UNet for segmentation, and claims that evaluations on public datasets demonstrate improved capture of hierarchical structure and contextual relationships, providing evidence that architectural inductive biases enhance slide-level and patch-level performance.

Significance. If substantiated, the work could advance computational pathology by offering an attention-based approach to hierarchical WSI modeling that avoids spatial distortions from independent patch processing. The dual variants illustrate versatility across tasks, and the focus on inductive biases for large-scale medical images aligns with ongoing needs in the field.

major comments (2)
  1. [Abstract] Abstract: the central claim that SPAN 'constructs multi-scale representations directly from single-scale inputs' to enable 'precise hierarchical modeling' of 'intrinsic' WSI pyramid representations lacks any quantitative check (e.g., feature alignment, ablation vs. explicit multi-resolution inputs, or distortion metrics). This assumption is load-bearing for attributing gains to faithful hierarchy recovery rather than attention or sparsity mechanisms alone.
  2. [Abstract] Abstract: the assertions of 'comprehensive evaluations across multiple public datasets' and 'clear evidence' that hierarchical representations enhance performance are unsupported by any reported metrics, baselines, error bars, data splits, or statistical tests, leaving the primary empirical claim without visible grounding.
minor comments (1)
  1. [Abstract] The abstract refers to 'multiple public datasets' without naming them or describing splits, which would help assess the scope of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the abstract to ensure claims are more precisely grounded in the reported experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that SPAN 'constructs multi-scale representations directly from single-scale inputs' to enable 'precise hierarchical modeling' of 'intrinsic' WSI pyramid representations lacks any quantitative check (e.g., feature alignment, ablation vs. explicit multi-resolution inputs, or distortion metrics). This assumption is load-bearing for attributing gains to faithful hierarchy recovery rather than attention or sparsity mechanisms alone.

    Authors: We agree that direct quantitative validation of the multi-scale construction would strengthen attribution of gains to the hierarchical inductive bias. The current experiments demonstrate performance benefits, but we will add an ablation comparing SPAN to explicit multi-resolution inputs along with feature alignment and distortion metrics in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the assertions of 'comprehensive evaluations across multiple public datasets' and 'clear evidence' that hierarchical representations enhance performance are unsupported by any reported metrics, baselines, error bars, data splits, or statistical tests, leaving the primary empirical claim without visible grounding.

    Authors: The full manuscript reports results across multiple public datasets with metrics, baselines, and data splits in the Experiments section. We will revise the abstract to reference these results more explicitly, incorporate error bars and statistical tests, and avoid overstatement of the evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: SPAN is an independent architectural proposal evaluated on external data

full rationale

The paper introduces SPAN as a new hierarchical framework that constructs multi-scale representations from single-scale inputs via attention-based mechanisms. No derivation step reduces by construction to fitted parameters, self-defined quantities, or load-bearing self-citations; the central claims rest on empirical evaluations across public datasets rather than on any renaming, ansatz smuggling, or uniqueness theorem imported from the authors' prior work. The assumption that WSIs possess recoverable intrinsic pyramid hierarchies is presented as a modeling premise, not as a result derived from the model's own outputs. This is the most common honest finding for an architectural proposal paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of a new attention-based hierarchical architecture whose specific layer choices, attention mechanisms, and scale-construction rules are introduced by the authors; these constitute free parameters in the model definition. No invented physical entities are postulated.

free parameters (1)
  • pyramid scale factors and attention head counts
    Design choices in the multi-scale construction and attention modules that are selected to enable the hierarchical modeling.
axioms (1)
  • domain assumption WSIs contain intrinsic hierarchical pyramid representations that are obscured by independent patch processing
    Invoked in the abstract to motivate the need for spatial-preserving multi-scale construction.

pith-pipeline@v0.9.0 · 5711 in / 1317 out tokens · 20564 ms · 2026-05-24T00:18:38.121671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association

    Esther Abels, Liron Pantanowitz, Famke Aeffner, Mark D Zarella, Jeroen van der Laak, Mar- ilyn M Bui, Venkata NP Vemuri, Anil V Parwani, Jeff Gibbs, Emmanuel Agosto-Arroyo, et al. Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association. The Journal of pathology,...

  2. [2]

    Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer

    Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes Van Diest, Bram Van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen AWM Van Der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    Bracs: A dataset for breast carcinoma subtyping in h&e histology images

    Nadia Brancati, Anna Maria Anniciello, Pushpak Pati, Daniel Riccio, Giosuè Scognamiglio, Guillaume Jaume, Giuseppe De Pietro, Maurizio Di Bonito, Antonio Foncubierta, Gerardo Botti, et al. Bracs: A dataset for breast carcinoma subtyping in h&e histology images. Database, 2022:baac093, 2022

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

    Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine, 25(8):1301–1309, 2019

  7. [7]

    Spconv: Spatially sparse convolution library

    Spconv Contributors. Spconv: Spatially sparse convolution library. https://github.com/ traveller59/spconv, 2022

  8. [8]

    Transformer-XL: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 2978–

  9. [9]

    doi: 10.18653/v1/P19-1285

    Association for Computational Linguistics, 2019. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285

  10. [10]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1

  11. [11]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https...

  13. [13]

    Alvarez, Jan Kautz, and Pavlo Molchanov

    Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=kB4yBiNmXX

  14. [14]

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015. 11

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  16. [16]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298, 2024

  17. [17]

    Attention-based deep multiple instance learning

    Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International conference on machine learning, pages 2127–2136. PMLR, 2018

  18. [18]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

  19. [19]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ c399862d3b...

  20. [20]

    Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning

    Bin Li, Yin Li, and Kevin W Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14318–14328, 2021

  21. [21]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  22. [22]

    Sparse convolutional neural networks

    Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015

  23. [23]

    Learning to encode position for transformer with continuous dynamical model

    Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. Learning to encode position for transformer with continuous dynamical model. In International conference on machine learning, pages 6327–6335. PMLR, 2020

  24. [24]

    Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Ro{bert}a: A robustly optimized {bert} pretraining approach, 2020. URL https://openreview.net/forum?id=SyxS0T4tvS

  25. [25]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  26. [26]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  27. [27]

    Distinctive image features from scale-invariant keypoints

    David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004

  28. [28]

    Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

  29. [29]

    Data-efficient and weakly supervised computational pathology on whole-slide images

    Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021

  30. [30]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations,

  31. [31]

    URL https://openreview.net/forum?id=R8sQPpGCv0. 12

  32. [32]

    Transmil: Transformer based correlated multiple instance learning for whole slide image classification

    Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

  33. [33]

    Self-attention with relative position represen- tations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position represen- tations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana, 2018. Association for Computational Ling...

  34. [34]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  35. [35]

    Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion

    Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple instance learning framework with masked hard instance mining for whole slide image classifica- tion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4078–4087, 2023

  36. [36]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. UR...

  37. [37]

    Deep high-resolution representation learning for visual recognition

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10): 3349–3364, 2020

  38. [38]

    Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

    Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019

  39. [39]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021

  40. [40]

    Nyströmformer: A nyström-based algorithm for approximating self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021

  41. [41]

    Focal attention for long-range interactions in vision transformers

    Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal attention for long-range interactions in vision transformers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30008–30022. Curran Associates, Inc., 2...

  42. [42]

    Big bird: Transformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33: 17283–17297, 2020

  43. [43]

    Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification

    Hongrun Zhang, Yanda Meng, Yitian Zhao, Yihong Qiao, Xiaoyun Yang, Sarah E Coupland, and Yalin Zheng. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18802–18812, 2022. 13