Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Karim Haroun; Martyna Poreba; Michal Szczepanski

arxiv: 2509.14165 · v1 · pith:3QK3GFHXnew · submitted 2025-09-17 · 💻 cs.CV · cs.AI

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Michal Szczepanski , Martyna Poreba , Karim Haroun This is my paper

Pith reviewed 2026-05-21 22:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Vision TransformersSemantic SegmentationToken PruningDynamic Patch MergingEarly ExitsEfficient InferenceHigh-Resolution Images

0 comments

The pith

STEP merges patches into superpatches via a CNN policy and prunes high-confidence tokens early to reduce Vision Transformer costs in high-resolution segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STEP as a way to make Vision Transformers practical for detailed semantic segmentation on large images by dynamically grouping standard patches into larger superpatches and stopping computation on tokens whose class is already clear. A small CNN decides the merges while early exits in the encoder layers remove up to 40 percent of tokens before they reach the end. The result is measured token reductions of 2.5 times from merging alone and up to 4 times overall, with corresponding drops in compute and gains in speed, while accuracy stays within 2 percent of the baseline model on high-resolution benchmarks.

Core claim

STEP integrates dCTS, a lightweight CNN-based policy network that enables flexible merging of patches into superpatches, together with early-exit blocks that remove high-confidence supertokens from further encoder processing, yielding up to 2.5 times fewer tokens from dCTS alone and 4 times lower computational complexity overall with at most a 2 percent accuracy drop on images up to 1024 by 1024.

What carries the argument

dCTS, the lightweight CNN-based policy network that decides flexible merging into superpatches, paired with early-exits on high-confidence supertokens inside the encoder blocks.

If this is right

dCTS alone cuts token count by a factor of 2.5, computational cost by 2.6 times, and raises throughput 3.4 times on a ViT-Large backbone.
The complete STEP framework reaches 4 times lower computational complexity and 1.7 times faster inference speed.
Up to 40 percent of tokens can be halted before the final encoder layer under the reported configurations.
The approach is tested on high-resolution semantic segmentation benchmarks with images as large as 1024 by 1024.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same merging-plus-early-exit pattern could be applied to other dense prediction tasks such as depth estimation or instance segmentation to achieve similar compute savings.
The locations where tokens exit early may highlight image regions that are easy to classify, suggesting a route to adaptive resolution or region-specific refinement.
Integrating STEP with model compression methods like quantization could compound the efficiency gains without additional accuracy cost.

Load-bearing premise

The small CNN policy network can choose merges that preserve necessary detail and that early removal of high-confidence tokens will not discard information later layers would need to fix for correct segmentation.

What would settle it

Run the full STEP pipeline on a 1024 by 1024 semantic segmentation test set and measure both final mIoU and effective FLOPs; if accuracy falls more than 2 percent or token count does not drop by at least 2 times relative to standard 16 by 16 patching, the efficiency claims are not supported.

read the original abstract

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEP combines CNN-driven dynamic merging with early exits to cut tokens and compute in high-res ViT segmentation, but the early-pruning safety rests on an assumption that later layers add little value.

read the letter

The main thing to know is that this paper gives a practical recipe for making ViT-Large run faster on 1024x1024 semantic segmentation. Their STEP method uses a lightweight CNN called dCTS to merge standard patches into content-dependent superpatches, then adds early exits inside encoder blocks so high-confidence supertokens can stop before the final layers. With dCTS alone they report a 2.5 times drop in token count versus fixed 16x16 patching, which yields 2.6 times lower compute and 3.4 times higher throughput. The full combination reaches 4 times complexity reduction, 1.7 times inference speed, at most 2 percent accuracy loss, and up to 40 percent of tokens exiting early.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes STEP, a hybrid token-reduction framework for Vision Transformers in high-resolution semantic segmentation. It introduces dCTS, a lightweight CNN-based policy network for dynamic merging of patches into superpatches, combined with early-exits in encoder blocks to prune high-confidence supertokens. On benchmarks with images up to 1024×1024, it claims that dCTS alone reduces token count by 2.5× (yielding 2.6× compute reduction and 3.4× throughput on ViT-Large), while the full STEP framework achieves up to 4× complexity reduction, 1.7× inference speedup, ≤2% accuracy drop, and early halting of up to 40% of tokens.

Significance. If the results hold and the early-pruning decisions prove safe, the work could provide a practical route to lowering the computational burden of ViT-based high-resolution segmentation. The hybrid design targets both spatial redundancy via flexible merging and per-layer computation via confidence-based exits, which is relevant for resource-limited settings. The reported throughput and complexity gains are substantial enough to matter for real-world deployment if the accuracy bound generalizes.

major comments (1)

The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.

minor comments (2)

The abstract reports concrete quantitative results (2.5× token reduction, 2.6× cost reduction, etc.) but does not specify the exact datasets, image resolutions tested, number of runs, or error bars, which would help readers assess the robustness of the efficiency numbers.
The introduction or method section would benefit from an explicit definition and diagram of how dCTS produces flexible superpatches versus standard fixed patching, to clarify the difference from prior token-merging techniques.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the validation of our early-exit strategy.

read point-by-point responses

Referee: The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.

Authors: We agree that a direct ablation against random or uniform token removal would better isolate the benefit of confidence-based selection and address potential concerns about information loss on boundaries or sparse regions in high-resolution inputs. The current manuscript reports overall accuracy and pruning statistics but does not include this specific controlled comparison at matched depths and token counts. In the revised version we will add such an ablation on the high-resolution benchmarks, removing equivalent numbers of tokens randomly or uniformly at the same encoder stages and measuring accuracy drops (overall and per-class/boundary). We expect this to show larger degradation under random/uniform policies, supporting that our confidence proxy is effective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical efficiency claims rest on measured outcomes against baselines

full rationale

The paper presents STEP as an empirical engineering framework combining a lightweight CNN policy network (dCTS) for dynamic superpatch merging with early-exit pruning in ViT encoders for high-resolution semantic segmentation. Reported gains (2.5x token reduction, 4x complexity drop, ≤2% accuracy loss, up to 40% early halts) are stated as experimental results on benchmarks up to 1024×1024 images versus the standard 16×16 patching baseline. No equations, fitted parameters, or self-citations are shown that would reduce these quantities to definitions or inputs internal to the paper itself. The derivation chain is therefore self-contained as a set of measured performance deltas rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claims rest on the effectiveness of a newly introduced CNN policy network and early-exit rules whose parameters are learned from data; the paper also assumes the standard 16x16 patching baseline is the relevant comparison point. No independent evidence for the new components is supplied beyond the reported numbers.

free parameters (2)

dCTS policy network weights
Learned parameters of the lightweight CNN that decides patch merging.
early-exit confidence threshold
Threshold used to decide when a supertoken is high-confident enough to halt.

axioms (1)

domain assumption Standard 16x16 pixel patching is the appropriate baseline for token count and cost comparisons.
Explicitly referenced in the abstract as the scheme against which reductions are measured.

invented entities (2)

dCTS (dynamic CNN-based token selector) no independent evidence
purpose: Lightweight network to enable flexible merging into superpatches.
New component introduced by the paper.
supertokens / superpatches no independent evidence
purpose: Merged tokens that reduce overall token count.
Core representational change proposed in the framework.

pith-pipeline@v0.9.0 · 5776 in / 1646 out tokens · 71211 ms · 2026-05-21T22:22:17.535800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

work page 2021
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

work page 2021
[3]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

work page 2021
[4]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 7262–7272 (2021)

work page 2021
[5]

Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: Towards unified image seg- mentation. Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

work page 2021
[6]

Advances in neural information processing systems 34, 12077–12090 (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Seg- former: Simple and eﬀicient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)

work page 2021
[7]

NeurIPS (2022)

Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., Liu, Y.: Segvit: Semantic segmentation with plain vision transformers. NeurIPS (2022)

work page 2022
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., A verta, G., Leibe, B., Dubbelman, G., Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025
[9]

In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp

Yoo, J., Ko, D., Kim, G.: Ccaseg: Decoding multi-scale context with convolu- tional cross-attention for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp. 9461–9470 (2025)

work page 2025
[10]

In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp

Yeom, S., Klitzing, J.: U‑mixformer: Unet‑like transformer with mix‑attention for eﬀicient semantic segmentation. In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp. 7710–7719 (2025)

work page 2025
[11]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp

Hu, X., Jiang, L., Schiele, B.: Training vision transformers for semi‑supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 4007–4017 (2024)

work page 2024
[12]

In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp

Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: Post-training quantiza- tion for fully quantized vision transformer. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179 (2022)

work page 2022
[13]

In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp

Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: Post-training quan- tization for vision transformers with twin uniform quantization. In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp. 191–207. Springer, Berlin, Heidel- berg (2022). https://doi.org/10.1007/978-3-031-19775-8_1...

work page doi:10.1007/978-3-031-19775-8_12 2022
[14]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp

Li, Z., Gu, Q.: I-vit: Integer-only quantization for eﬀicient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp. 17065–17075 (2023)

work page 2023
[15]

Transactions on Machine Learning Research (2024)

Huang, X., Shen, Z., Dong, P., Cheng, K.-T.: Quantization variation: A new per- spective on training transformers with low-bit precision. Transactions on Machine Learning Research (2024)

work page 2024
[16]

In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp

Shang, Y., Liu, G., Kompella, R., Yan, Y.: Quantized-vit eﬀicient train- ing via fisher matrix regularization. In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp. 270–284. Springer, Berlin, Heidel- berg (2025). https://doi.org/10.1007/978-981-96-2064-7_20 . ...

work page doi:10.1007/978-981-96-2064-7_20 2025
[17]

In: European Conference on Computer Vision (ECCV) (2022)

Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision (ECCV) (2022)

work page 2022
[18]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C., Li, Y.: ViTKD: Feature-based Knowledge Distillation for Vision Transformers . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp. 1379–1388. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/CVPRW63382.2024.00145 . https://doi.ieeecompu...

work page doi:10.1109/cvprw63382.2024.00145 2024
[19]

In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp

Proust, M., Poreba, M., Szczepanski, M., Haroun, K.: Step: Supertoken and early-pruning for eﬀicient semantic segmentation. In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp. 56–61 (2025). https://doi.org/10.5220/ 0013132800003912 . https://www.scitepress.org/Papers/2025...

work page 2025
[20]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Eﬀicient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023
[21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Havtorn, J.D., Royer, A., Blankevoort, T., Bejnordi, B.E.: MSViT: Dynamic mixed-scale tokenization for vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 838–848 (2023)

work page 2023
[22]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Chen, M., Lin, M., Li, K., Shen, Y., Wu, Y., Chao, F., Ji, R.: Cf-vit: A gen- eral coarse-to-fine method for vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37

work page
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ronen, T., Levy, O., Golbert, A.: Vision transformers with mixed-resolution tok- enization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4622

work page
[24]

https: //arxiv.org/abs/2403.16020

Mahmud, T., Yaman, B., Liu, C.-H., Marculescu, D.: PaPr: Training-Free One- Step Patch Pruning with Lightweight ConvNets for Faster Inference (2024). https: //arxiv.org/abs/2403.16020

work page arXiv 2024
[25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: Dynamicvit: Eﬀicient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

work page 2021
[26]

European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for eﬀicient vision transformers. European Conference on Computer Vision (ECCV) (2022)

work page 2022
[27]

In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K.: Learned token pruning for transformers. In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 784–794. Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3534678.3539260 . https://doi.or...

work page doi:10.1145/3534678.3539260 2022
[28]

: Spvit: Enabling faster vision transformers via latency-aware soft token pruning

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al. : Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 620–640 (2022). Springer

work page 2022
[29]

In: Interna- tional Conference on Learning Representations (2022)

Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: Interna- tional Conference on Learning Representations (2022). https://openreview.net/ 26 forum?id=BjyvwnXXVn_

work page 2022
[30]

A ConvNet for the 2020s

Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-G., Lim, S.- N.: AdaViT: Adaptive Vision Transformers for Eﬀicient Image Recog- nition . In: 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 12299–12308. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.01199 . https://doi...

work page doi:10.1109/cvpr52688.2022.01199 2022
[31]

CP-ViT: Cascade vision trans- former pruning via progressive sparsity prediction.arXiv preprint arXiv:2203.04570, 2022

Song, Z., Xu, Y., He, Z., Jiang, L., Jing, N., Liang, X.: CP-ViT: Cascade Vision Transformer Pruning Via Progressive Sparsity Prediction. https://doi.org/10. 48550/arXiv.2203.04570

work page arXiv
[32]

360mvsnet: Deep multi-view stereo network with 360° images for indoor scene reconstruction,

Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 12–21 (2023). https://doi.org/10.1109/WACV56688.2023.00010

work page doi:10.1109/wacv56688.2023.00010 2023
[33]

In: International Conference on Learning Representations (2023)

Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)

work page 2023
[34]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 777–786. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00078 . https://doi.ieeecomputersociety.org/10.1109/ICCV...

work page doi:10.1109/iccv51070.2023.00078 2023
[36]

1222–1230 (2023)

Liu, X., Wu, T., Guo, G.: Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention, pp. 1222–1230 (2023). https://doi.org/ 10.24963/ijcai.2023/136

work page doi:10.24963/ijcai.2023/136 2023
[37]

Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

Marchetti, M., Traini, D., Ursino, D., Virgili, L.: Eﬀicient token pruning in vision transformers using an attention-based multilayer network. Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

work page doi:10.1016/j.eswa.2025.127449 2025
[38]

Emogen: Emotional image content generation with text-to-image diffusion models,

Wang, H., Dedhia, B., Jha, N.K.: Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers, pp. 16070–16079 (2024). https://doi.org/10.1109/CVPR52733.2024.01521

work page doi:10.1109/cvpr52733.2024.01521 2024
[39]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Xu, Y., Zhang, Z., Zhang, M., Sheng, K., Li, K., Dong, W., Zhang, L., Xu, C., Sun, X.: Evo-vit: Slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2964–2972 (2022) 27

work page 2022
[40]

https://arxiv.org/abs/2311.00586

Courdier, E., Sivaprasad, P.T., Fleuret, F.: PAUMER: Patch Pausing Trans- former for Semantic Segmentation (2023). https://arxiv.org/abs/2311.00586

work page arXiv 2023
[41]

In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

work page doi:10.1109/wacv57701.2024.00184 2024
[42]

In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmenta- tion. In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp. 2646–2656. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/WACV57701.2024.00264 . https://doi.ieee...

work page doi:10.1109/wacv57701.2024.00264 2024
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A- ViT: Adaptive tokens for eﬀicient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10809–10818 (2022)

work page 2022
[45]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Zeng, W., Jin, S., Xu, L., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Tcformer: Visual recognition via token clustering transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

work page 2024
[46]

In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Li, J., Wang, Y., ZHANG, X., Shi, B., Jiang, D., Li, C., Dai, W., Xiong, H., Tian, Q.: Ailurus: A scalable vit framework for dense prediction. In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 30979–30996. Curran Asso- ciates, Inc., ??? (2023). https://proceed...

work page 2023
[47]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 12–21 (2023)

work page 2023
[48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Norouzi, N., Orlova, S., De Geus, D., Dubbelman, G.: Algm: Adaptive local- then-global token merging for eﬀicient semantic segmentation with plain vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15773–15782 (2024)

work page 2024
[49]

In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

Haroun, K., Martinet, J., Chehida, K.B., Allenet, T.: Leveraging local similarity 28 for token merging in vision transformers. In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

work page 2024
[50]

In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

Haroun, K., Allenet, T., Chehida, K.B., Martinet, J.: Dynamic hierarchical token merging for vision transformers. In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

work page 2025
[51]

In: Conference on Neural Information Processing Systems (2024)

Lee, D.H., Hong, S.: Learning to merge tokens via decoupled embedding for eﬀicient vision transformers. In: Conference on Neural Information Processing Systems (2024)

work page 2024
[52]

Transactions on Machine Learning Research (2023)

Bonnaerens, M., Dambre, J.: Learned thresholds token merging and pruning for vision transformers. Transactions on Machine Learning Research (2023)

work page 2023
[53]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383–1392 (2024)

work page 2024
[54]

Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

Wu, X., Zeng, F., Wang, X., Chen, X.: PPT: Token Pruning and Pooling for Eﬀicient Vision Transformers (2024). https://arxiv.org/abs/2310.01812

work page arXiv 2024
[55]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for eﬀicient vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17164–17174 (2023)

work page 2023
[56]

Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

Chen, D., Lin, K., Deng, Q.: Ucc: A unified cascade compression framework for vision transformer models. Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

work page doi:10.1016/j.neucom.2024.128747 2025
[57]

IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

Mao, J., Shen, Y., Guo, J., Yao, Y., Hua, X., Shen, H.: Prune and merge: Eﬀi- cient token compression for vision transformer with spatial information preserved. IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

work page arXiv 2025
[58]

arXiv:2211.11167 (2022)

Huang, H., Zhou, X., Cao, J., He, R., Tan, T.: Vision transformer with super token sampling. arXiv:2211.11167 (2022)

work page arXiv 2022
[59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111 (2022)

work page 2022
[60]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

Rao, Y., Liu, Z., Zhao, W., Zhou, J., Lu, J.: Dynamic spatial sparsification for eﬀicient vision transformers and convolutional neural networks. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

work page doi:10.1109/tpami.2023.3263826 2023
[61]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M., Le, Q.V.: Eﬀicientnet: Rethinking model scaling for convolutional neural networks. ArXiv abs/1905.11946 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905
[62]

Berg and Li Fei-Fei , Title =

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[63]

https://github.com/open-mmlab/ mmsegmentation (2020)

MMSegmentation Contributors: MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/ mmsegmentation (2020)

work page 2020
[64]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1209–1218. IEEE Computer Society, Los Alamitos, CA, USA (2018). https://doi.org/10.1109/CVPR.2018.00132 . https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00132

work page doi:10.1109/cvpr.2018.00132 2018
[65]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017
[66]

In: Proc

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 30

work page 2016

[1] [1]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

work page 2021

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

work page 2021

[3] [3]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

work page 2021

[4] [4]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 7262–7272 (2021)

work page 2021

[5] [5]

Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: Towards unified image seg- mentation. Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

work page 2021

[6] [6]

Advances in neural information processing systems 34, 12077–12090 (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Seg- former: Simple and eﬀicient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)

work page 2021

[7] [7]

NeurIPS (2022)

Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., Liu, Y.: Segvit: Semantic segmentation with plain vision transformers. NeurIPS (2022)

work page 2022

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., A verta, G., Leibe, B., Dubbelman, G., Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

work page 2025

[9] [9]

In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp

Yoo, J., Ko, D., Kim, G.: Ccaseg: Decoding multi-scale context with convolu- tional cross-attention for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp. 9461–9470 (2025)

work page 2025

[10] [10]

In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp

Yeom, S., Klitzing, J.: U‑mixformer: Unet‑like transformer with mix‑attention for eﬀicient semantic segmentation. In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp. 7710–7719 (2025)

work page 2025

[11] [11]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp

Hu, X., Jiang, L., Schiele, B.: Training vision transformers for semi‑supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 4007–4017 (2024)

work page 2024

[12] [12]

In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp

Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: Post-training quantiza- tion for fully quantized vision transformer. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179 (2022)

work page 2022

[13] [13]

In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp

Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: Post-training quan- tization for vision transformers with twin uniform quantization. In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp. 191–207. Springer, Berlin, Heidel- berg (2022). https://doi.org/10.1007/978-3-031-19775-8_1...

work page doi:10.1007/978-3-031-19775-8_12 2022

[14] [14]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp

Li, Z., Gu, Q.: I-vit: Integer-only quantization for eﬀicient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp. 17065–17075 (2023)

work page 2023

[15] [15]

Transactions on Machine Learning Research (2024)

Huang, X., Shen, Z., Dong, P., Cheng, K.-T.: Quantization variation: A new per- spective on training transformers with low-bit precision. Transactions on Machine Learning Research (2024)

work page 2024

[16] [16]

In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp

Shang, Y., Liu, G., Kompella, R., Yan, Y.: Quantized-vit eﬀicient train- ing via fisher matrix regularization. In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp. 270–284. Springer, Berlin, Heidel- berg (2025). https://doi.org/10.1007/978-981-96-2064-7_20 . ...

work page doi:10.1007/978-981-96-2064-7_20 2025

[17] [17]

In: European Conference on Computer Vision (ECCV) (2022)

Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision (ECCV) (2022)

work page 2022

[18] [18]

In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C., Li, Y.: ViTKD: Feature-based Knowledge Distillation for Vision Transformers . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp. 1379–1388. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/CVPRW63382.2024.00145 . https://doi.ieeecompu...

work page doi:10.1109/cvprw63382.2024.00145 2024

[19] [19]

In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp

Proust, M., Poreba, M., Szczepanski, M., Haroun, K.: Step: Supertoken and early-pruning for eﬀicient semantic segmentation. In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp. 56–61 (2025). https://doi.org/10.5220/ 0013132800003912 . https://www.scitepress.org/Papers/2025...

work page 2025

[20] [20]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Eﬀicient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

work page 2023

[21] [21]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Havtorn, J.D., Royer, A., Blankevoort, T., Bejnordi, B.E.: MSViT: Dynamic mixed-scale tokenization for vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 838–848 (2023)

work page 2023

[22] [22]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Chen, M., Lin, M., Li, K., Shen, Y., Wu, Y., Chao, F., Ji, R.: Cf-vit: A gen- eral coarse-to-fine method for vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37

work page

[23] [23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Ronen, T., Levy, O., Golbert, A.: Vision transformers with mixed-resolution tok- enization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4622

work page

[24] [24]

https: //arxiv.org/abs/2403.16020

Mahmud, T., Yaman, B., Liu, C.-H., Marculescu, D.: PaPr: Training-Free One- Step Patch Pruning with Lightweight ConvNets for Faster Inference (2024). https: //arxiv.org/abs/2403.16020

work page arXiv 2024

[25] [25]

In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: Dynamicvit: Eﬀicient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

work page 2021

[26] [26]

European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for eﬀicient vision transformers. European Conference on Computer Vision (ECCV) (2022)

work page 2022

[27] [27]

In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K.: Learned token pruning for transformers. In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 784–794. Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3534678.3539260 . https://doi.or...

work page doi:10.1145/3534678.3539260 2022

[28] [28]

: Spvit: Enabling faster vision transformers via latency-aware soft token pruning

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al. : Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 620–640 (2022). Springer

work page 2022

[29] [29]

In: Interna- tional Conference on Learning Representations (2022)

Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: Interna- tional Conference on Learning Representations (2022). https://openreview.net/ 26 forum?id=BjyvwnXXVn_

work page 2022

[30] [30]

A ConvNet for the 2020s

Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-G., Lim, S.- N.: AdaViT: Adaptive Vision Transformers for Eﬀicient Image Recog- nition . In: 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 12299–12308. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.01199 . https://doi...

work page doi:10.1109/cvpr52688.2022.01199 2022

[31] [31]

CP-ViT: Cascade vision trans- former pruning via progressive sparsity prediction.arXiv preprint arXiv:2203.04570, 2022

Song, Z., Xu, Y., He, Z., Jiang, L., Jing, N., Liang, X.: CP-ViT: Cascade Vision Transformer Pruning Via Progressive Sparsity Prediction. https://doi.org/10. 48550/arXiv.2203.04570

work page arXiv

[32] [32]

360mvsnet: Deep multi-view stereo network with 360° images for indoor scene reconstruction,

Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 12–21 (2023). https://doi.org/10.1109/WACV56688.2023.00010

work page doi:10.1109/wacv56688.2023.00010 2023

[33] [33]

In: International Conference on Learning Representations (2023)

Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)

work page 2023

[34] [34]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 777–786. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00078 . https://doi.ieeecomputersociety.org/10.1109/ICCV...

work page doi:10.1109/iccv51070.2023.00078 2023

[35] [36]

1222–1230 (2023)

Liu, X., Wu, T., Guo, G.: Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention, pp. 1222–1230 (2023). https://doi.org/ 10.24963/ijcai.2023/136

work page doi:10.24963/ijcai.2023/136 2023

[36] [37]

Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

Marchetti, M., Traini, D., Ursino, D., Virgili, L.: Eﬀicient token pruning in vision transformers using an attention-based multilayer network. Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

work page doi:10.1016/j.eswa.2025.127449 2025

[37] [38]

Emogen: Emotional image content generation with text-to-image diffusion models,

Wang, H., Dedhia, B., Jha, N.K.: Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers, pp. 16070–16079 (2024). https://doi.org/10.1109/CVPR52733.2024.01521

work page doi:10.1109/cvpr52733.2024.01521 2024

[38] [39]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Xu, Y., Zhang, Z., Zhang, M., Sheng, K., Li, K., Dong, W., Zhang, L., Xu, C., Sun, X.: Evo-vit: Slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2964–2972 (2022) 27

work page 2022

[39] [40]

https://arxiv.org/abs/2311.00586

Courdier, E., Sivaprasad, P.T., Fleuret, F.: PAUMER: Patch Pausing Trans- former for Semantic Segmentation (2023). https://arxiv.org/abs/2311.00586

work page arXiv 2023

[40] [41]

In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

work page doi:10.1109/wacv57701.2024.00184 2024

[41] [42]

In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmenta- tion. In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp. 2646–2656. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/WACV57701.2024.00264 . https://doi.ieee...

work page doi:10.1109/wacv57701.2024.00264 2024

[42] [43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A- ViT: Adaptive tokens for eﬀicient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10809–10818 (2022)

work page 2022

[43] [45]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Zeng, W., Jin, S., Xu, L., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Tcformer: Visual recognition via token clustering transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

work page 2024

[44] [46]

In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Li, J., Wang, Y., ZHANG, X., Shi, B., Jiang, D., Li, C., Dai, W., Xiong, H., Tian, Q.: Ailurus: A scalable vit framework for dense prediction. In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 30979–30996. Curran Asso- ciates, Inc., ??? (2023). https://proceed...

work page 2023

[45] [47]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 12–21 (2023)

work page 2023

[46] [48]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Norouzi, N., Orlova, S., De Geus, D., Dubbelman, G.: Algm: Adaptive local- then-global token merging for eﬀicient semantic segmentation with plain vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15773–15782 (2024)

work page 2024

[47] [49]

In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

Haroun, K., Martinet, J., Chehida, K.B., Allenet, T.: Leveraging local similarity 28 for token merging in vision transformers. In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

work page 2024

[48] [50]

In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

Haroun, K., Allenet, T., Chehida, K.B., Martinet, J.: Dynamic hierarchical token merging for vision transformers. In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

work page 2025

[49] [51]

In: Conference on Neural Information Processing Systems (2024)

Lee, D.H., Hong, S.: Learning to merge tokens via decoupled embedding for eﬀicient vision transformers. In: Conference on Neural Information Processing Systems (2024)

work page 2024

[50] [52]

Transactions on Machine Learning Research (2023)

Bonnaerens, M., Dambre, J.: Learned thresholds token merging and pruning for vision transformers. Transactions on Machine Learning Research (2023)

work page 2023

[51] [53]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383–1392 (2024)

work page 2024

[52] [54]

Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

Wu, X., Zeng, F., Wang, X., Chen, X.: PPT: Token Pruning and Pooling for Eﬀicient Vision Transformers (2024). https://arxiv.org/abs/2310.01812

work page arXiv 2024

[53] [55]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for eﬀicient vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17164–17174 (2023)

work page 2023

[54] [56]

Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

Chen, D., Lin, K., Deng, Q.: Ucc: A unified cascade compression framework for vision transformer models. Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

work page doi:10.1016/j.neucom.2024.128747 2025

[55] [57]

IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

Mao, J., Shen, Y., Guo, J., Yao, Y., Hua, X., Shen, H.: Prune and merge: Eﬀi- cient token compression for vision transformer with spatial information preserved. IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

work page arXiv 2025

[56] [58]

arXiv:2211.11167 (2022)

Huang, H., Zhou, X., Cao, J., He, R., Tan, T.: Vision transformer with super token sampling. arXiv:2211.11167 (2022)

work page arXiv 2022

[57] [59]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111 (2022)

work page 2022

[58] [60]

IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

Rao, Y., Liu, Z., Zhao, W., Zhou, J., Lu, J.: Dynamic spatial sparsification for eﬀicient vision transformers and convolutional neural networks. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

work page doi:10.1109/tpami.2023.3263826 2023

[59] [61]

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Tan, M., Le, Q.V.: Eﬀicientnet: Rethinking model scaling for convolutional neural networks. ArXiv abs/1905.11946 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905

[60] [62]

Berg and Li Fei-Fei , Title =

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[61] [63]

https://github.com/open-mmlab/ mmsegmentation (2020)

MMSegmentation Contributors: MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/ mmsegmentation (2020)

work page 2020

[62] [64]

In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1209–1218. IEEE Computer Society, Los Alamitos, CA, USA (2018). https://doi.org/10.1109/CVPR.2018.00132 . https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00132

work page doi:10.1109/cvpr.2018.00132 2018

[63] [65]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017

[64] [66]

In: Proc

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 30

work page 2016