MPM: Mutual Pair Merging for Efficient Vision Transformers

David Rousseau; Pejman Rasti; Simon Rav\'e

arxiv: 2604.05718 · v1 · submitted 2026-04-07 · 💻 cs.CV

MPM: Mutual Pair Merging for Efficient Vision Transformers

Simon Rav\'e , Pejman Rasti , David Rousseau This is my paper

Pith reviewed 2026-05-10 19:18 UTC · model grok-4.3

classification 💻 cs.CV

keywords token reductionvision transformersemantic segmentationmutual nearest neighbortraining-freeinference accelerationADE20Klatency measurement

0 comments

The pith

Mutual Pair Merging shortens vision transformer sequences for semantic segmentation by averaging mutual nearest-neighbor token pairs while preserving reconstruction for existing decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that token reduction in vision transformers can deliver real end-to-end latency improvements for semantic segmentation when the reduction method accounts for reconstruction needs and computational overhead. It establishes this through Mutual Pair Merging, which pairs tokens that are mutual nearest neighbors in cosine space, averages the pairs to reduce sequence length, and keeps a merge map for later gather-based recovery of the full feature map. This approach requires no training or extra parameters, with the compression level set by choosing where to insert the module. On ADE20K, it yields up to 60 percent lower per-image latency on Raspberry Pi 5 and 20 percent higher throughput on H100, with accuracy loss under 3 percent mIoU. Such results indicate that simple pairing strategies can make acceleration practical for dense prediction tasks where prior methods fell short on wall-clock metrics.

Core claim

MPM forms mutual nearest-neighbor pairs in cosine space, averages each pair to shorten the token sequence processed by the transformer, and records a merge map that permits gather-based reconstruction of the original-resolution features immediately before the segmentation decoder, allowing any existing head to be used without modification or retraining.

What carries the argument

Mutual nearest-neighbor pairing in cosine similarity space that produces pairs where each token is the nearest neighbor of its partner, combined with the recorded merge map enabling gather-based reconstruction.

Load-bearing premise

The time required to identify mutual nearest-neighbor pairs and to perform the subsequent gather reconstruction does not outweigh the computational savings from processing shorter sequences.

What would settle it

A direct timing experiment on the reported hardware and models in which adding MPM increases rather than decreases total inference latency.

Figures

Figures reproduced from arXiv: 2604.05718 by David Rousseau, Pejman Rasti, Simon Rav\'e.

**Figure 1.** Figure 1: Visual abstract of Mutual Pair Merging (MPM). Similar tokens are matched by mutual pairs and averaged together. Tokens without [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Visualization of MPM on the same image during daytime [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy vs. FPS for MPM using different insertion [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPM shows real end-to-end latency cuts for segmentation ViTs via training-free mutual nearest-neighbor merging, and the reported numbers already fold in the pair-computation cost.

read the letter

The main point is that this work takes token merging and makes it practical for semantic segmentation by using mutual nearest-neighbor pairs in cosine space, then keeping a merge map so the decoder can reconstruct dense features without touching the head. Everything stays training-free and the only knob is a discrete schedule for when to merge. They measure full wall-clock latency on H100 (with and without FlashAttention-2) and on Raspberry Pi 5, not just FLOPs or throughput proxies, and report mIoU drops below 3% on ADE20K while claiming up to 60% latency reduction for ViT-Tiny on the Pi and 20% throughput gain on the H100.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mutual Pair Merging (MPM), a training-free token aggregation module for vision transformers in semantic segmentation. It identifies mutual nearest-neighbor pairs in cosine space, averages each pair, records a merge map, and performs gather-based reconstruction before the decoder so that existing heads can be used unchanged. No learned parameters or continuous compression knobs are introduced; the speed-accuracy trade-off is controlled solely by a discrete insertion schedule. End-to-end latency and throughput are measured on NVIDIA H100 (with and without FlashAttention-2) and Raspberry Pi 5 across standard segmentation datasets, with the central claim that MPM yields up to 60% per-image latency reduction for ViT-Tiny on the Raspberry Pi 5 and up to 20% throughput increase on H100 while keeping the mIoU drop below 3%.

Significance. If the reported net gains hold after full accounting of overhead, the work provides concrete evidence that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock improvements for dense prediction on both accelerators and edge hardware. This addresses a documented limitation in prior token-reduction literature, which often relies on proxy metrics or classification-only settings and rarely reports hardware-measured end-to-end latency for segmentation.

major comments (2)

[Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.
[Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.

minor comments (2)

[Abstract] The abstract states throughput increases 'by up to 20%' without specifying the exact baseline configuration (e.g., whether FlashAttention-2 is enabled in the baseline).
[Tables and Figures] Figure captions and table footnotes should explicitly state the number of runs or seeds used for the reported latency and mIoU numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical value of MPM for end-to-end latency improvements in semantic segmentation. We address each major comment below and will revise the manuscript to incorporate the requested details and breakdowns.

read point-by-point responses

Referee: [Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.

Authors: We agree that a component-wise timing breakdown is necessary to substantiate the net gains and to address the overhead concerns raised in the abstract. In the revised manuscript we will add explicit latency breakdowns (in tables and/or figures) separating the mutual nearest-neighbor search, merge-map construction, token averaging, attention computation, and gather-based reconstruction on both the H100 (with and without FlashAttention-2) and Raspberry Pi 5. These measurements will be obtained from the same experimental setup used for the reported end-to-end figures and will demonstrate that MPM overhead remains small relative to the attention savings. revision: yes
Referee: [Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.

Authors: We acknowledge that additional implementation details are required for full reproducibility. The insertion schedule is model-specific (chosen empirically per architecture such as ViT-Tiny to meet the target accuracy-latency operating point) but contains no learned parameters and is independent of the dataset. In the revision we will expand the method section with (i) the exact layer indices at which merges occur, (ii) the number of pairs merged at each insertion point for the reported configurations, and (iii) a brief description of the empirical selection procedure. This information will also be summarized in a table and accompanied by pseudocode in the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method validated by direct hardware measurements

full rationale

The paper introduces a training-free token aggregation algorithm (mutual nearest-neighbor pairing in cosine space, averaging, and gather-based reconstruction) and reports its effects via wall-clock latency and throughput benchmarks on H100 and Raspberry Pi 5. No equations, derivations, or fitted parameters are presented that reduce the claimed speedups to inputs by construction. The central claims rest on external, reproducible measurements rather than self-referential definitions or self-citation chains, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method relies on the domain assumption that averaging mutual nearest neighbors preserves enough information for segmentation, together with the engineering choice of a discrete insertion schedule; no new physical entities or fitted constants are introduced.

free parameters (1)

discrete insertion schedule
The user selects at which layers the merging module is inserted; this discrete choice controls the speed-accuracy operating point.

axioms (1)

domain assumption Mutual nearest neighbors in cosine space form pairs whose average retains sufficient semantic information for downstream dense prediction
Invoked to justify training-free merging without accuracy collapse.

pith-pipeline@v0.9.0 · 5542 in / 1560 out tokens · 62998 ms · 2026-05-10T19:18:10.526578+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MPM computes cosine affinities between tokens, forms pairs using a deterministic mutual nearest-neighbor rule, and merges each accepted pair by simple averaging. A lightweight integer merge map is stored and composed across multiple insertions, and we reconstruct the original H/P×W/P token grid via a gather-based copy-back before the decoder.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The speed-accuracy trade-off is set by a discrete insertion schedule. ... MPM has no learned parameters and no continuous compression knob.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Token cropr: Faster vits for quite a few tasks

Benjamin Bergner, Christoph Lippert, and Aravindh Ma- hendran. Token cropr: Faster vits for quite a few tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9740–9750. Computer Vision Foundation / IEEE,

work page 2025
[2]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 1, 2, 3, 5

work page 2023
[3]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 7

work page 2023
[4]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1280–

work page 2022
[5]

1, 2, 7, 8

IEEE, 2022. 1, 2, 7, 8

work page 2022
[6]

Flashattention-2: Faster attention with better paral- lelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 2, 3, 5, 7

work page 2024
[7]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,

work page 2022
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Represen- tations, ICLR 202...

work page 2021
[9]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6804–6815. IEEE,

work page 2021
[10]

EV A: exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A: exploring the limits of masked visual representation learning at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19358–19369. IEEE, 2023. 7, 8

work page 2023
[11]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 396–414. Sp...

work page 2022
[12]

Tay- lor, and Thomas B

Joakim Bruslund Haurum, Sergio Escalera, Graham W. Tay- lor, and Thomas B. Moeslund. Agglomerative token clus- tering. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part LVII, pages 200–218. Springer, 2024. 2, 3

work page 2024
[13]

Token fusion: Bridging the gap between token pruning and token merging

Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InIEEE/CVF Winter Con- ference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, pages 1372–1381. IEEE, 2024. 2

work page 2024
[14]

Spvit: Enabling faster vision transformers via latency-aware soft token pruning

Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin, and Yanzhi Wang. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 620–64...

work page 2022
[15]

Learning to merge to- kens via decoupled embedding for efficient vision transform- ers

Dong Hoon Lee and Seunghoon Hong. Learning to merge to- kens via decoupled embedding for efficient vision transform- ers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, De- cember 10 - 15, 2024, 2024. 2

work page 2024
[16]

Expediting large-scale vision transformer for dense prediction without fine-tuning

Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Wei- hong Lin, Ding Jia, Zheng Zhang, Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction without fine-tuning. InAdvances in Neural Information Pro- cessing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...

work page 2022
[17]

Evit: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 2

work page 2022
[18]

Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention

Xiangcheng Liu, Tianyi Wu, and Guodong Guo. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJ- CAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 1222–1230. ijcai.org, 2023. 2

work page 2023
[19]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. 2, 8

work page 2021
[20]

Content- aware token sharing for efficient semantic segmentation with vision transformers

Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Content- aware token sharing for efficient semantic segmentation with vision transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 23631–23640. IEEE, 2023. 1, 2, 3, 5, 8

work page 2023
[21]

Token pooling in vision transformers for image classification

Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, An- ish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers for image classification. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 12–21. IEEE, 2023. 2

work page 2023
[22]

ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers

Narges Norouzi, Svetlana Orlova, Daan de Geus, and Gijs Dubbelman. ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 15773–15782. IEEE, 2024. 1, 2, 3, 8

work page 2024
[23]

Dynamicvit: Efficient vision trans- formers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision trans- formers with dynamic token sparsification. InAdvances in Neural Information Processing Systems 34: Annual Con- ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937– 13949, 2021. 2

work page 2021
[24]

Hiera: A hier- archical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023,...

work page 2023
[25]

Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12786–12797, 2021. 2

work page 2021
[26]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 1, 2, 3, 8

work page 2024
[27]

Segmenter: Transformer for semantic segmentation

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7242–7252. IEEE, 2021. 1, 2, 3, 6, 7

work page 2021
[28]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 10347–10357. PMLR, 2021. 6, 7

work page 2021
[29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 1

work page 2017
[30]

Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 548–558. IEEE, 2021. 2

work page 2021
[31]

PVT v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media, 8(3):415–424, 2022. 2

work page 2022
[32]

Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation

Zhehao Wang, Xian Lin, Nannan Wu, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innova- tive Applications of Artificial Intelligence, February 20-27, 2024, Vancouver, ...

work page 2024
[33]

https://arxiv.org/abs/2310.01812

Xin-Jian Wu, Fanhu Zeng, Xiudong Wang, Yunhe Wang, and Xinghao Chen. PPT: token pruning and pooling for efficient vision transformers.CoRR, abs/2310.01812, 2023. 2

work page arXiv 2023
[34]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 432–448. Springer, 2018. 2

work page 2018
[35]

´Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jos´e M. ´Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12077–12090...

work page 2021
[36]

Evo-vit: Slow-fast token evolution for dynamic vision transformer

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Inno- vative Applications of Artificial Intelligence, IAAI 2022, The Twelvet...

work page 2022
[37]

´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jos´e M. ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10799–10808. IEEE, 2022. 2

work page 2022
[38]

Segvit: Semantic segmentation with plain vision transformers

Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - De- cember 9, 2022, 2022. 1

work page 2022
[39]

Scene parsing through ADE20K dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar- riuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5122–5130. IEEE Computer Society, 2017. 7

work page 2017

[1] [1]

Token cropr: Faster vits for quite a few tasks

Benjamin Bergner, Christoph Lippert, and Aravindh Ma- hendran. Token cropr: Faster vits for quite a few tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9740–9750. Computer Vision Foundation / IEEE,

work page 2025

[2] [2]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 1, 2, 3, 5

work page 2023

[3] [3]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 7

work page 2023

[4] [4]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1280–

work page 2022

[5] [5]

1, 2, 7, 8

IEEE, 2022. 1, 2, 7, 8

work page 2022

[6] [6]

Flashattention-2: Faster attention with better paral- lelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 2, 3, 5, 7

work page 2024

[7] [7]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,

work page 2022

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Represen- tations, ICLR 202...

work page 2021

[9] [9]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6804–6815. IEEE,

work page 2021

[10] [10]

EV A: exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A: exploring the limits of masked visual representation learning at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19358–19369. IEEE, 2023. 7, 8

work page 2023

[11] [11]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 396–414. Sp...

work page 2022

[12] [12]

Tay- lor, and Thomas B

Joakim Bruslund Haurum, Sergio Escalera, Graham W. Tay- lor, and Thomas B. Moeslund. Agglomerative token clus- tering. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part LVII, pages 200–218. Springer, 2024. 2, 3

work page 2024

[13] [13]

Token fusion: Bridging the gap between token pruning and token merging

Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InIEEE/CVF Winter Con- ference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, pages 1372–1381. IEEE, 2024. 2

work page 2024

[14] [14]

Spvit: Enabling faster vision transformers via latency-aware soft token pruning

Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin, and Yanzhi Wang. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 620–64...

work page 2022

[15] [15]

Learning to merge to- kens via decoupled embedding for efficient vision transform- ers

Dong Hoon Lee and Seunghoon Hong. Learning to merge to- kens via decoupled embedding for efficient vision transform- ers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, De- cember 10 - 15, 2024, 2024. 2

work page 2024

[16] [16]

Expediting large-scale vision transformer for dense prediction without fine-tuning

Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Wei- hong Lin, Ding Jia, Zheng Zhang, Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction without fine-tuning. InAdvances in Neural Information Pro- cessing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...

work page 2022

[17] [17]

Evit: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 2

work page 2022

[18] [18]

Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention

Xiangcheng Liu, Tianyi Wu, and Guodong Guo. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJ- CAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 1222–1230. ijcai.org, 2023. 2

work page 2023

[19] [19]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. 2, 8

work page 2021

[20] [20]

Content- aware token sharing for efficient semantic segmentation with vision transformers

Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Content- aware token sharing for efficient semantic segmentation with vision transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 23631–23640. IEEE, 2023. 1, 2, 3, 5, 8

work page 2023

[21] [21]

Token pooling in vision transformers for image classification

Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, An- ish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers for image classification. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 12–21. IEEE, 2023. 2

work page 2023

[22] [22]

ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers

Narges Norouzi, Svetlana Orlova, Daan de Geus, and Gijs Dubbelman. ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 15773–15782. IEEE, 2024. 1, 2, 3, 8

work page 2024

[23] [23]

Dynamicvit: Efficient vision trans- formers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision trans- formers with dynamic token sparsification. InAdvances in Neural Information Processing Systems 34: Annual Con- ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937– 13949, 2021. 2

work page 2021

[24] [24]

Hiera: A hier- archical vision transformer without the bells-and-whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023,...

work page 2023

[25] [25]

Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12786–12797, 2021. 2

work page 2021

[26] [26]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 1, 2, 3, 8

work page 2024

[27] [27]

Segmenter: Transformer for semantic segmentation

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7242–7252. IEEE, 2021. 1, 2, 3, 6, 7

work page 2021

[28] [28]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 10347–10357. PMLR, 2021. 6, 7

work page 2021

[29] [29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 1

work page 2017

[30] [30]

Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 548–558. IEEE, 2021. 2

work page 2021

[31] [31]

PVT v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media, 8(3):415–424, 2022. 2

work page 2022

[32] [32]

Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation

Zhehao Wang, Xian Lin, Nannan Wu, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innova- tive Applications of Artificial Intelligence, February 20-27, 2024, Vancouver, ...

work page 2024

[33] [33]

https://arxiv.org/abs/2310.01812

Xin-Jian Wu, Fanhu Zeng, Xiudong Wang, Yunhe Wang, and Xinghao Chen. PPT: token pruning and pooling for efficient vision transformers.CoRR, abs/2310.01812, 2023. 2

work page arXiv 2023

[34] [34]

Unified perceptual parsing for scene understanding

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 432–448. Springer, 2018. 2

work page 2018

[35] [35]

´Alvarez, and Ping Luo

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jos´e M. ´Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12077–12090...

work page 2021

[36] [36]

Evo-vit: Slow-fast token evolution for dynamic vision transformer

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Inno- vative Applications of Artificial Intelligence, IAAI 2022, The Twelvet...

work page 2022

[37] [37]

´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jos´e M. ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10799–10808. IEEE, 2022. 2

work page 2022

[38] [38]

Segvit: Semantic segmentation with plain vision transformers

Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - De- cember 9, 2022, 2022. 1

work page 2022

[39] [39]

Scene parsing through ADE20K dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar- riuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5122–5130. IEEE Computer Society, 2017. 7

work page 2017