MPM: Mutual Pair Merging for Efficient Vision Transformers
Pith reviewed 2026-05-10 19:18 UTC · model grok-4.3
The pith
Mutual Pair Merging shortens vision transformer sequences for semantic segmentation by averaging mutual nearest-neighbor token pairs while preserving reconstruction for existing decoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MPM forms mutual nearest-neighbor pairs in cosine space, averages each pair to shorten the token sequence processed by the transformer, and records a merge map that permits gather-based reconstruction of the original-resolution features immediately before the segmentation decoder, allowing any existing head to be used without modification or retraining.
What carries the argument
Mutual nearest-neighbor pairing in cosine similarity space that produces pairs where each token is the nearest neighbor of its partner, combined with the recorded merge map enabling gather-based reconstruction.
Load-bearing premise
The time required to identify mutual nearest-neighbor pairs and to perform the subsequent gather reconstruction does not outweigh the computational savings from processing shorter sequences.
What would settle it
A direct timing experiment on the reported hardware and models in which adding MPM increases rather than decreases total inference latency.
Figures
read the original abstract
Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mutual Pair Merging (MPM), a training-free token aggregation module for vision transformers in semantic segmentation. It identifies mutual nearest-neighbor pairs in cosine space, averages each pair, records a merge map, and performs gather-based reconstruction before the decoder so that existing heads can be used unchanged. No learned parameters or continuous compression knobs are introduced; the speed-accuracy trade-off is controlled solely by a discrete insertion schedule. End-to-end latency and throughput are measured on NVIDIA H100 (with and without FlashAttention-2) and Raspberry Pi 5 across standard segmentation datasets, with the central claim that MPM yields up to 60% per-image latency reduction for ViT-Tiny on the Raspberry Pi 5 and up to 20% throughput increase on H100 while keeping the mIoU drop below 3%.
Significance. If the reported net gains hold after full accounting of overhead, the work provides concrete evidence that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock improvements for dense prediction on both accelerators and edge hardware. This addresses a documented limitation in prior token-reduction literature, which often relies on proxy metrics or classification-only settings and rarely reports hardware-measured end-to-end latency for segmentation.
major comments (2)
- [Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.
- [Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.
minor comments (2)
- [Abstract] The abstract states throughput increases 'by up to 20%' without specifying the exact baseline configuration (e.g., whether FlashAttention-2 is enabled in the baseline).
- [Tables and Figures] Figure captions and table footnotes should explicitly state the number of runs or seeds used for the reported latency and mIoU numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the practical value of MPM for end-to-end latency improvements in semantic segmentation. We address each major comment below and will revise the manuscript to incorporate the requested details and breakdowns.
read point-by-point responses
-
Referee: [Experiments / Latency evaluation] The central latency claims (up to 60% reduction on Raspberry Pi 5 for ViT-Tiny and 20% throughput gain on H100) are load-bearing and rest on the assumption that mutual nearest-neighbor pair computation plus merge-map construction and gather reconstruction impose negligible overhead. No component-wise timing breakdown (merge step versus attention versus decoder) is supplied on either target platform, despite the abstract explicitly noting that such overheads have erased gains in prior work.
Authors: We agree that a component-wise timing breakdown is necessary to substantiate the net gains and to address the overhead concerns raised in the abstract. In the revised manuscript we will add explicit latency breakdowns (in tables and/or figures) separating the mutual nearest-neighbor search, merge-map construction, token averaging, attention computation, and gather-based reconstruction on both the H100 (with and without FlashAttention-2) and Raspberry Pi 5. These measurements will be obtained from the same experimental setup used for the reported end-to-end figures and will demonstrate that MPM overhead remains small relative to the attention savings. revision: yes
-
Referee: [Method / Insertion schedule] The discrete insertion schedule is presented as the sole mechanism for controlling the trade-off, yet the manuscript provides insufficient detail on its concrete implementation (e.g., which layers receive merges, how many pairs are formed per insertion point, and whether the schedule is dataset- or model-specific). This information is required to reproduce the reported mIoU/latency points and to assess the claim that the method is fully parameter-free.
Authors: We acknowledge that additional implementation details are required for full reproducibility. The insertion schedule is model-specific (chosen empirically per architecture such as ViT-Tiny to meet the target accuracy-latency operating point) but contains no learned parameters and is independent of the dataset. In the revision we will expand the method section with (i) the exact layer indices at which merges occur, (ii) the number of pairs merged at each insertion point for the reported configurations, and (iii) a brief description of the empirical selection procedure. This information will also be summarized in a table and accompanied by pseudocode in the supplementary material. revision: yes
Circularity Check
No circularity; empirical method validated by direct hardware measurements
full rationale
The paper introduces a training-free token aggregation algorithm (mutual nearest-neighbor pairing in cosine space, averaging, and gather-based reconstruction) and reports its effects via wall-clock latency and throughput benchmarks on H100 and Raspberry Pi 5. No equations, derivations, or fitted parameters are presented that reduce the claimed speedups to inputs by construction. The central claims rest on external, reproducible measurements rather than self-referential definitions or self-citation chains, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- discrete insertion schedule
axioms (1)
- domain assumption Mutual nearest neighbors in cosine space form pairs whose average retains sufficient semantic information for downstream dense prediction
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MPM computes cosine affinities between tokens, forms pairs using a deterministic mutual nearest-neighbor rule, and merges each accepted pair by simple averaging. A lightweight integer merge map is stored and composed across multiple insertions, and we reconstruct the original H/P×W/P token grid via a gather-based copy-back before the decoder.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The speed-accuracy trade-off is set by a discrete insertion schedule. ... MPM has no learned parameters and no continuous compression knob.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Token cropr: Faster vits for quite a few tasks
Benjamin Bergner, Christoph Lippert, and Aravindh Ma- hendran. Token cropr: Faster vits for quite a few tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 9740–9750. Computer Vision Foundation / IEEE,
work page 2025
-
[2]
Token merging: Your vit but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 1, 2, 3, 5
work page 2023
-
[3]
Vision transformer adapter for dense predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. 7
work page 2023
-
[4]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 1280–
work page 2022
- [5]
-
[6]
Flashattention-2: Faster attention with better paral- lelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better paral- lelism and work partitioning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 1, 2, 3, 5, 7
work page 2024
-
[7]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022,
work page 2022
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Represen- tations, ICLR 202...
work page 2021
-
[9]
Multiscale vision transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In2021 IEEE/CVF Interna- tional Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 6804–6815. IEEE,
work page 2021
-
[10]
EV A: exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EV A: exploring the limits of masked visual representation learning at scale. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19358–19369. IEEE, 2023. 7, 8
work page 2023
-
[11]
Adaptive token sampling for efficient vision transformers
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and J¨urgen Gall. Adaptive token sampling for efficient vision transformers. InComputer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 396–414. Sp...
work page 2022
-
[12]
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Tay- lor, and Thomas B. Moeslund. Agglomerative token clus- tering. InComputer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part LVII, pages 200–218. Springer, 2024. 2, 3
work page 2024
-
[13]
Token fusion: Bridging the gap between token pruning and token merging
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InIEEE/CVF Winter Con- ference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, pages 1372–1381. IEEE, 2024. 2
work page 2024
-
[14]
Spvit: Enabling faster vision transformers via latency-aware soft token pruning
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin, and Yanzhi Wang. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XI, pages 620–64...
work page 2022
-
[15]
Learning to merge to- kens via decoupled embedding for efficient vision transform- ers
Dong Hoon Lee and Seunghoon Hong. Learning to merge to- kens via decoupled embedding for efficient vision transform- ers. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, De- cember 10 - 15, 2024, 2024. 2
work page 2024
-
[16]
Expediting large-scale vision transformer for dense prediction without fine-tuning
Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Wei- hong Lin, Ding Jia, Zheng Zhang, Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction without fine-tuning. InAdvances in Neural Information Pro- cessing Systems 35: Annual Conference on Neural Informa- tion Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA,...
work page 2022
-
[17]
Evit: Expediting vision transformers via token reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Evit: Expediting vision transformers via token reorganizations. InThe Tenth International Confer- ence on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. 2
work page 2022
-
[18]
Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention
Xiangcheng Liu, Tianyi Wu, and Guodong Guo. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJ- CAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 1222–1230. ijcai.org, 2023. 2
work page 2023
-
[19]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. 2, 8
work page 2021
-
[20]
Content- aware token sharing for efficient semantic segmentation with vision transformers
Chenyang Lu, Daan de Geus, and Gijs Dubbelman. Content- aware token sharing for efficient semantic segmentation with vision transformers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 23631–23640. IEEE, 2023. 1, 2, 3, 5, 8
work page 2023
-
[21]
Token pooling in vision transformers for image classification
Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, An- ish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers for image classification. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 12–21. IEEE, 2023. 2
work page 2023
-
[22]
Narges Norouzi, Svetlana Orlova, Daan de Geus, and Gijs Dubbelman. ALGM: adaptive local-then-global token merg- ing for efficient semantic segmentation with plain vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 15773–15782. IEEE, 2024. 1, 2, 3, 8
work page 2024
-
[23]
Dynamicvit: Efficient vision trans- formers with dynamic token sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision trans- formers with dynamic token sparsification. InAdvances in Neural Information Processing Systems 34: Annual Con- ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 13937– 13949, 2021. 2
work page 2021
-
[24]
Hiera: A hier- archical vision transformer without the bells-and-whistles
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, and Christoph Feichtenhofer. Hiera: A hier- archical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023,...
work page 2023
-
[25]
Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. Tokenlearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12786–12797, 2021. 2
work page 2021
-
[26]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InAd- vances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. 1, 2, 3, 8
work page 2024
-
[27]
Segmenter: Transformer for semantic segmentation
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 7242–7252. IEEE, 2021. 1, 2, 3, 6, 7
work page 2021
-
[28]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through at- tention. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 10347–10357. PMLR, 2021. 6, 7
work page 2021
-
[29]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017. 1
work page 2017
-
[30]
Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense predic- tion without convolutions. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 548–558. IEEE, 2021. 2
work page 2021
-
[31]
PVT v2: Improved baselines with pyramid vision transformer
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media, 8(3):415–424, 2022. 2
work page 2022
-
[32]
Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation
Zhehao Wang, Xian Lin, Nannan Wu, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. Dtmformer: Dynamic token merging for boosting transformer-based medical image seg- mentation. InThirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innova- tive Applications of Artificial Intelligence, February 20-27, 2024, Vancouver, ...
work page 2024
-
[33]
https://arxiv.org/abs/2310.01812
Xin-Jian Wu, Fanhu Zeng, Xiudong Wang, Yunhe Wang, and Xinghao Chen. PPT: token pruning and pooling for efficient vision transformers.CoRR, abs/2310.01812, 2023. 2
-
[34]
Unified perceptual parsing for scene understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 432–448. Springer, 2018. 2
work page 2018
-
[35]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jos´e M. ´Alvarez, and Ping Luo. Segformer: Simple and effi- cient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems 34: An- nual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12077–12090...
work page 2021
-
[36]
Evo-vit: Slow-fast token evolution for dynamic vision transformer
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Inno- vative Applications of Artificial Intelligence, IAAI 2022, The Twelvet...
work page 2022
-
[37]
´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov
Hongxu Yin, Arash Vahdat, Jos´e M. ´Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10799–10808. IEEE, 2022. 2
work page 2022
-
[38]
Segvit: Semantic segmentation with plain vision transformers
Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic segmentation with plain vision transformers. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - De- cember 9, 2022, 2022. 1
work page 2022
-
[39]
Scene parsing through ADE20K dataset
Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Bar- riuso, and Antonio Torralba. Scene parsing through ADE20K dataset. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, pages 5122–5130. IEEE Computer Society, 2017. 7
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.