pith. sign in

arxiv: 2604.05718 · v1 · submitted 2026-04-07 · 💻 cs.CV

MPM: Mutual Pair Merging for Efficient Vision Transformers

Pith reviewed 2026-05-10 19:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords token reductionvision transformersemantic segmentationmutual nearest neighbortraining-freeinference accelerationADE20Klatency measurement
0
0 comments X

The pith

Mutual Pair Merging shortens vision transformer sequences for semantic segmentation by averaging mutual nearest-neighbor token pairs while preserving reconstruction for existing decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that token reduction in vision transformers can deliver real end-to-end latency improvements for semantic segmentation when the reduction method accounts for reconstruction needs and computational overhead. It establishes this through Mutual Pair Merging, which pairs tokens that are mutual nearest neighbors in cosine space, averages the pairs to reduce sequence length, and keeps a merge map for later gather-based recovery of the full feature map. This approach requires no training or extra parameters, with the compression level set by choosing where to insert the module. On ADE20K, it yields up to 60 percent lower per-image latency on Raspberry Pi 5 and 20 percent higher throughput on H100, with accuracy loss under 3 percent mIoU. Such results indicate that simple pairing strategies can make acceleration practical for dense prediction tasks where prior methods fell short on wall-clock metrics.

Core claim

MPM forms mutual nearest-neighbor pairs in cosine space, averages each pair to shorten the token sequence processed by the transformer, and records a merge map that permits gather-based reconstruction of the original-resolution features immediately before the segmentation decoder, allowing any existing head to be used without modification or retraining.

What carries the argument

Mutual nearest-neighbor pairing in cosine similarity space that produces pairs where each token is the nearest neighbor of its partner, combined with the recorded merge map enabling gather-based reconstruction.

Load-bearing premise

The time required to identify mutual nearest-neighbor pairs and to perform the subsequent gather reconstruction does not outweigh the computational savings from processing shorter sequences.

What would settle it

A direct timing experiment on the reported hardware and models in which adding MPM increases rather than decreases total inference latency.

Figures

Figures reproduced from arXiv: 2604.05718 by David Rousseau, Pejman Rasti, Simon Rav\'e.

Figure 1
Figure 1. Figure 1: Visual abstract of Mutual Pair Merging (MPM). Similar tokens are matched by mutual pairs and averaged together. Tokens without [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of MPM on the same image during daytime [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy vs. FPS for MPM using different insertion [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mutual Pair Merging (MPM), a training-free token aggregation module for vision transformers in semantic segmentation. It identifies mutual nearest-neighbor pairs in cosine space, averages each pair, records a merge map, and performs gather-based reconstruction before the decoder so that existing heads can be used unchanged. No learned parameters or continuous compression knobs are introduced; the speed-accuracy trade-off is controlled solely by a discrete insertion schedule. End-to-end latency and throughput are measured on NVIDIA H100 (with and without FlashAttention-2) and Raspberry Pi 5 across standard segmentation datasets, with the central claim that MPM yields up to 60% per-image latency reduction for ViT-Tiny on the Raspberry Pi 5 and up to 20% throughput increase on H100 while keeping the mIoU drop below 3%.

Significance. If the reported net gains hold after full accounting of overhead, the work provides concrete evidence that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock improvements for dense prediction on both accelerators and edge hardware. This addresses a documented limitation in prior token-reduction literature, which often relies on proxy metrics or classification-only settings and rarely reports hardware-measured end-to-end latency for segmentation.

major comments (2)
  1. [Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.
  2. [Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.
minor comments (2)
  1. [Abstract] The abstract states throughput increases 'by up to 20%' without specifying the exact baseline configuration (e.g., whether FlashAttention-2 is enabled in the baseline).
  2. [Tables and Figures] Figure captions and table footnotes should explicitly state the number of runs or seeds used for the reported latency and mIoU numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of MPM for end-to-end latency improvements in semantic segmentation. We address each major comment below and will revise the manuscript to incorporate the requested details and breakdowns.

read point-by-point responses
  1. Referee: [Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.

    Authors: We agree that a component-wise timing breakdown is necessary to substantiate the net gains and to address the overhead concerns raised in the abstract. In the revised manuscript we will add explicit latency breakdowns (in tables and/or figures) separating the mutual nearest-neighbor search, merge-map construction, token averaging, attention computation, and gather-based reconstruction on both the H100 (with and without FlashAttention-2) and Raspberry Pi 5. These measurements will be obtained from the same experimental setup used for the reported end-to-end figures and will demonstrate that MPM overhead remains small relative to the attention savings. revision: yes

  2. Referee: [Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.

    Authors: We acknowledge that additional implementation details are required for full reproducibility. The insertion schedule is model-specific (chosen empirically per architecture such as ViT-Tiny to meet the target accuracy-latency operating point) but contains no learned parameters and is independent of the dataset. In the revision we will expand the method section with (i) the exact layer indices at which merges occur, (ii) the number of pairs merged at each insertion point for the reported configurations, and (iii) a brief description of the empirical selection procedure. This information will also be summarized in a table and accompanied by pseudocode in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method validated by direct hardware measurements

full rationale

The paper introduces a training-free token aggregation algorithm (mutual nearest-neighbor pairing in cosine space, averaging, and gather-based reconstruction) and reports its effects via wall-clock latency and throughput benchmarks on H100 and Raspberry Pi 5. No equations, derivations, or fitted parameters are presented that reduce the claimed speedups to inputs by construction. The central claims rest on external, reproducible measurements rather than self-referential definitions or self-citation chains, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption that averaging mutual nearest neighbors preserves enough information for segmentation, together with the engineering choice of a discrete insertion schedule; no new physical entities or fitted constants are introduced.

free parameters (1)
  • discrete insertion schedule
    The user selects at which layers the merging module is inserted; this discrete choice controls the speed-accuracy operating point.
axioms (1)
  • domain assumption Mutual nearest neighbors in cosine space form pairs whose average retains sufficient semantic information for downstream dense prediction
    Invoked to justify training-free merging without accuracy collapse.

pith-pipeline@v0.9.0 · 5542 in / 1560 out tokens · 62998 ms · 2026-05-10T19:18:10.526578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Token cropr: Faster vits for quite a few tasks

    Benjamin Bergner, Christoph Lippert, and Aravindh Ma- hendran. Token cropr: Faster vits for quite a few tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9740–9750. Computer Vision Foundation / IEEE,

  2. [2]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 1, 2, 3, 5

  3. [3]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 7

  4. [4]

    Schwing, Alexan- der Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1280–

  5. [5]

    1, 2, 7, 8

    IEEE, 2022. 1, 2, 7, 8

  6. [6]

    Flashattention-2: Faster attention with better paral- lelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 2, 3, 5, 7

  7. [7]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,

  8. [8]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Represen- tations, ICLR 202...

  9. [9]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6804–6815. IEEE,

  10. [10]

    EV A: exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A: exploring the limits of masked visual representation learning at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19358–19369. IEEE, 2023. 7, 8

  11. [11]

    Adaptive token sampling for efficient vision transformers

    Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 396–414. Sp...

  12. [12]

    Tay- lor, and Thomas B

    Joakim Bruslund Haurum, Sergio Escalera, Graham W. Tay- lor, and Thomas B. Moeslund. Agglomerative token clus- tering. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part LVII, pages 200–218. Springer, 2024. 2, 3

  13. [13]

    Token fusion: Bridging the gap between token pruning and token merging

    Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InIEEE/CVF Winter Con- ference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, pages 1372–1381. IEEE, 2024. 2

  14. [14]

    Spvit: Enabling faster vision transformers via latency-aware soft token pruning

    Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin, and Yanzhi Wang. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 620–64...

  15. [15]

    Learning to merge to- kens via decoupled embedding for efficient vision transform- ers

    Dong Hoon Lee and Seunghoon Hong. Learning to merge to- kens via decoupled embedding for efficient vision transform- ers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, De- cember 10 - 15, 2024, 2024. 2

  16. [16]

    Expediting large-scale vision transformer for dense prediction without fine-tuning

    Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Wei- hong Lin, Ding Jia, Zheng Zhang, Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction without fine-tuning. InAdvances in Neural Information Pro- cessing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...

  17. [17]

    Evit: Expediting vision transformers via token reorganizations

    Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 2

  18. [18]

    Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention

    Xiangcheng Liu, Tianyi Wu, and Guodong Guo. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJ- CAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 1222–1230. ijcai.org, 2023. 2

  19. [19]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. 2, 8

  20. [20]

    Content- aware token sharing for efficient semantic segmentation with vision transformers

    Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Content- aware token sharing for efficient semantic segmentation with vision transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 23631–23640. IEEE, 2023. 1, 2, 3, 5, 8

  21. [21]

    Token pooling in vision transformers for image classification

    Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, An- ish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers for image classification. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 12–21. IEEE, 2023. 2

  22. [22]

    ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers

    Narges Norouzi, Svetlana Orlova, Daan de Geus, and Gijs Dubbelman. ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 15773–15782. IEEE, 2024. 1, 2, 3, 8

  23. [23]

    Dynamicvit: Efficient vision trans- formers with dynamic token sparsification

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision trans- formers with dynamic token sparsification. InAdvances in Neural Information Processing Systems 34: Annual Con- ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937– 13949, 2021. 2

  24. [24]

    Hiera: A hier- archical vision transformer without the bells-and-whistles

    Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023,...

  25. [25]

    Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12786–12797, 2021. 2

  26. [26]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 1, 2, 3, 8

  27. [27]

    Segmenter: Transformer for semantic segmentation

    Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7242–7252. IEEE, 2021. 1, 2, 3, 6, 7

  28. [28]

    Training data-efficient image transformers & distillation through at- tention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 10347–10357. PMLR, 2021. 6, 7

  29. [29]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 1

  30. [30]

    Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 548–558. IEEE, 2021. 2

  31. [31]

    PVT v2: Improved baselines with pyramid vision transformer

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media, 8(3):415–424, 2022. 2

  32. [32]

    Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation

    Zhehao Wang, Xian Lin, Nannan Wu, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innova- tive Applications of Artificial Intelligence, February 20-27, 2024, Vancouver, ...

  33. [33]

    https://arxiv.org/abs/2310.01812

    Xin-Jian Wu, Fanhu Zeng, Xiudong Wang, Yunhe Wang, and Xinghao Chen. PPT: token pruning and pooling for efficient vision transformers.CoRR, abs/2310.01812, 2023. 2

  34. [34]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 432–448. Springer, 2018. 2

  35. [35]

    ´Alvarez, and Ping Luo

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jos´e M. ´Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12077–12090...

  36. [36]

    Evo-vit: Slow-fast token evolution for dynamic vision transformer

    Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Inno- vative Applications of Artificial Intelligence, IAAI 2022, The Twelvet...

  37. [37]

    ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

    Hongxu Yin, Arash Vahdat, Jos´e M. ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10799–10808. IEEE, 2022. 2

  38. [38]

    Segvit: Semantic segmentation with plain vision transformers

    Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - De- cember 9, 2022, 2022. 1

  39. [39]

    Scene parsing through ADE20K dataset

    Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar- riuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5122–5130. IEEE Computer Society, 2017. 7