pith. sign in

arxiv: 2509.14165 · v1 · pith:3QK3GFHXnew · submitted 2025-09-17 · 💻 cs.CV · cs.AI

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Pith reviewed 2026-05-21 22:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords Vision TransformersSemantic SegmentationToken PruningDynamic Patch MergingEarly ExitsEfficient InferenceHigh-Resolution Images
0
0 comments X

The pith

STEP merges patches into superpatches via a CNN policy and prunes high-confidence tokens early to reduce Vision Transformer costs in high-resolution segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STEP as a way to make Vision Transformers practical for detailed semantic segmentation on large images by dynamically grouping standard patches into larger superpatches and stopping computation on tokens whose class is already clear. A small CNN decides the merges while early exits in the encoder layers remove up to 40 percent of tokens before they reach the end. The result is measured token reductions of 2.5 times from merging alone and up to 4 times overall, with corresponding drops in compute and gains in speed, while accuracy stays within 2 percent of the baseline model on high-resolution benchmarks.

Core claim

STEP integrates dCTS, a lightweight CNN-based policy network that enables flexible merging of patches into superpatches, together with early-exit blocks that remove high-confidence supertokens from further encoder processing, yielding up to 2.5 times fewer tokens from dCTS alone and 4 times lower computational complexity overall with at most a 2 percent accuracy drop on images up to 1024 by 1024.

What carries the argument

dCTS, the lightweight CNN-based policy network that decides flexible merging into superpatches, paired with early-exits on high-confidence supertokens inside the encoder blocks.

If this is right

  • dCTS alone cuts token count by a factor of 2.5, computational cost by 2.6 times, and raises throughput 3.4 times on a ViT-Large backbone.
  • The complete STEP framework reaches 4 times lower computational complexity and 1.7 times faster inference speed.
  • Up to 40 percent of tokens can be halted before the final encoder layer under the reported configurations.
  • The approach is tested on high-resolution semantic segmentation benchmarks with images as large as 1024 by 1024.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same merging-plus-early-exit pattern could be applied to other dense prediction tasks such as depth estimation or instance segmentation to achieve similar compute savings.
  • The locations where tokens exit early may highlight image regions that are easy to classify, suggesting a route to adaptive resolution or region-specific refinement.
  • Integrating STEP with model compression methods like quantization could compound the efficiency gains without additional accuracy cost.

Load-bearing premise

The small CNN policy network can choose merges that preserve necessary detail and that early removal of high-confidence tokens will not discard information later layers would need to fix for correct segmentation.

What would settle it

Run the full STEP pipeline on a 1024 by 1024 semantic segmentation test set and measure both final mIoU and effective FLOPs; if accuracy falls more than 2 percent or token count does not drop by at least 2 times relative to standard 16 by 16 patching, the efficiency claims are not supported.

read the original abstract

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes STEP, a hybrid token-reduction framework for Vision Transformers in high-resolution semantic segmentation. It introduces dCTS, a lightweight CNN-based policy network for dynamic merging of patches into superpatches, combined with early-exits in encoder blocks to prune high-confidence supertokens. On benchmarks with images up to 1024×1024, it claims that dCTS alone reduces token count by 2.5× (yielding 2.6× compute reduction and 3.4× throughput on ViT-Large), while the full STEP framework achieves up to 4× complexity reduction, 1.7× inference speedup, ≤2% accuracy drop, and early halting of up to 40% of tokens.

Significance. If the results hold and the early-pruning decisions prove safe, the work could provide a practical route to lowering the computational burden of ViT-based high-resolution segmentation. The hybrid design targets both spatial redundancy via flexible merging and per-layer computation via confidence-based exits, which is relevant for resource-limited settings. The reported throughput and complexity gains are substantial enough to matter for real-world deployment if the accuracy bound generalizes.

major comments (1)
  1. The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.
minor comments (2)
  1. The abstract reports concrete quantitative results (2.5× token reduction, 2.6× cost reduction, etc.) but does not specify the exact datasets, image resolutions tested, number of runs, or error bars, which would help readers assess the robustness of the efficiency numbers.
  2. The introduction or method section would benefit from an explicit definition and diagram of how dCTS produces flexible superpatches versus standard fixed patching, to clarify the difference from prior token-merging techniques.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the validation of our early-exit strategy.

read point-by-point responses
  1. Referee: The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.

    Authors: We agree that a direct ablation against random or uniform token removal would better isolate the benefit of confidence-based selection and address potential concerns about information loss on boundaries or sparse regions in high-resolution inputs. The current manuscript reports overall accuracy and pruning statistics but does not include this specific controlled comparison at matched depths and token counts. In the revised version we will add such an ablation on the high-resolution benchmarks, removing equivalent numbers of tokens randomly or uniformly at the same encoder stages and measuring accuracy drops (overall and per-class/boundary). We expect this to show larger degradation under random/uniform policies, supporting that our confidence proxy is effective. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical efficiency claims rest on measured outcomes against baselines

full rationale

The paper presents STEP as an empirical engineering framework combining a lightweight CNN policy network (dCTS) for dynamic superpatch merging with early-exit pruning in ViT encoders for high-resolution semantic segmentation. Reported gains (2.5x token reduction, 4x complexity drop, ≤2% accuracy loss, up to 40% early halts) are stated as experimental results on benchmarks up to 1024×1024 images versus the standard 16×16 patching baseline. No equations, fitted parameters, or self-citations are shown that would reduce these quantities to definitions or inputs internal to the paper itself. The derivation chain is therefore self-contained as a set of measured performance deltas rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claims rest on the effectiveness of a newly introduced CNN policy network and early-exit rules whose parameters are learned from data; the paper also assumes the standard 16x16 patching baseline is the relevant comparison point. No independent evidence for the new components is supplied beyond the reported numbers.

free parameters (2)
  • dCTS policy network weights
    Learned parameters of the lightweight CNN that decides patch merging.
  • early-exit confidence threshold
    Threshold used to decide when a supertoken is high-confident enough to halt.
axioms (1)
  • domain assumption Standard 16x16 pixel patching is the appropriate baseline for token count and cost comparisons.
    Explicitly referenced in the abstract as the scheme against which reductions are measured.
invented entities (2)
  • dCTS (dynamic CNN-based token selector) no independent evidence
    purpose: Lightweight network to enable flexible merging into superpatches.
    New component introduced by the paper.
  • supertokens / superpatches no independent evidence
    purpose: Merged tokens that reduce overall token count.
    Core representational change proposed in the framework.

pith-pipeline@v0.9.0 · 5776 in / 1646 out tokens · 71211 ms · 2026-05-21T22:22:17.535800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

  3. [3]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)

  4. [4]

    In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp

    Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 7262–7272 (2021)

  5. [5]

    Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

    Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: Towards unified image seg- mentation. Advances in Neural Information Processing Systems 34, 10326–10338 (2021)

  6. [6]

    Advances in neural information processing systems 34, 12077–12090 (2021)

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Seg- former: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)

  7. [7]

    NeurIPS (2022)

    Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., Liu, Y.: Segvit: Semantic segmentation with plain vision transformers. NeurIPS (2022)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., A verta, G., Leibe, B., Dubbelman, G., Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  9. [9]

    In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp

    Yoo, J., Ko, D., Kim, G.: Ccaseg: Decoding multi-scale context with convolu- tional cross-attention for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp. 9461–9470 (2025)

  10. [10]

    In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp

    Yeom, S., Klitzing, J.: U‑mixformer: Unet‑like transformer with mix‑attention for efficient semantic segmentation. In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp. 7710–7719 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp

    Hu, X., Jiang, L., Schiele, B.: Training vision transformers for semi‑supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 4007–4017 (2024)

  12. [12]

    In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp

    Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: Post-training quantiza- tion for fully quantized vision transformer. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179 (2022)

  13. [13]

    In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp

    Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: Post-training quan- tization for vision transformers with twin uniform quantization. In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp. 191–207. Springer, Berlin, Heidel- berg (2022). https://doi.org/10.1007/978-3-031-19775-8_1...

  14. [14]

    In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp

    Li, Z., Gu, Q.: I-vit: Integer-only quantization for efficient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp. 17065–17075 (2023)

  15. [15]

    Transactions on Machine Learning Research (2024)

    Huang, X., Shen, Z., Dong, P., Cheng, K.-T.: Quantization variation: A new per- spective on training transformers with low-bit precision. Transactions on Machine Learning Research (2024)

  16. [16]

    In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp

    Shang, Y., Liu, G., Kompella, R., Yan, Y.: Quantized-vit efficient train- ing via fisher matrix regularization. In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp. 270–284. Springer, Berlin, Heidel- berg (2025). https://doi.org/10.1007/978-981-96-2064-7_20 . ...

  17. [17]

    In: European Conference on Computer Vision (ECCV) (2022)

    Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision (ECCV) (2022)

  18. [18]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp

    Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C., Li, Y.: ViTKD: Feature-based Knowledge Distillation for Vision Transformers . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp. 1379–1388. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/CVPRW63382.2024.00145 . https://doi.ieeecompu...

  19. [19]

    In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp

    Proust, M., Poreba, M., Szczepanski, M., Haroun, K.: Step: Supertoken and early-pruning for efficient semantic segmentation. In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp. 56–61 (2025). https://doi.org/10.5220/ 0013132800003912 . https://www.scitepress.org/Papers/2025...

  20. [20]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  21. [21]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Havtorn, J.D., Royer, A., Blankevoort, T., Bejnordi, B.E.: MSViT: Dynamic mixed-scale tokenization for vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 838–848 (2023)

  22. [22]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Chen, M., Lin, M., Li, K., Shen, Y., Wu, Y., Chao, F., Ji, R.: Cf-vit: A gen- eral coarse-to-fine method for vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Ronen, T., Levy, O., Golbert, A.: Vision transformers with mixed-resolution tok- enization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4622

  24. [24]

    https: //arxiv.org/abs/2403.16020

    Mahmud, T., Yaman, B., Liu, C.-H., Marculescu, D.: PaPr: Training-Free One- Step Patch Pruning with Lightweight ConvNets for Faster Inference (2024). https: //arxiv.org/abs/2403.16020

  25. [25]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

    Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)

  26. [26]

    European Conference on Computer Vision (ECCV) (2022)

    Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision (ECCV) (2022)

  27. [27]

    In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

    Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K.: Learned token pruning for transformers. In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 784–794. Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3534678.3539260 . https://doi.or...

  28. [28]

    : Spvit: Enabling faster vision transformers via latency-aware soft token pruning

    Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al. : Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 620–640 (2022). Springer

  29. [29]

    In: Interna- tional Conference on Learning Representations (2022)

    Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: Interna- tional Conference on Learning Representations (2022). https://openreview.net/ 26 forum?id=BjyvwnXXVn_

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-G., Lim, S.- N.: AdaViT: Adaptive Vision Transformers for Efficient Image Recog- nition . In: 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 12299–12308. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.01199 . https://doi...

  31. [31]

    CP-ViT: Cascade vision trans- former pruning via progressive sparsity prediction.arXiv preprint arXiv:2203.04570, 2022

    Song, Z., Xu, Y., He, Z., Jiang, L., Jing, N., Liang, X.: CP-ViT: Cascade Vision Transformer Pruning Via Progressive Sparsity Prediction. https://doi.org/10. 48550/arXiv.2203.04570

  32. [32]

    360mvsnet: Deep multi-view stereo network with 360° images for indoor scene reconstruction,

    Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 12–21 (2023). https://doi.org/10.1109/WACV56688.2023.00010

  33. [33]

    In: International Conference on Learning Representations (2023)

    Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)

  34. [34]

    In: IEEE/CVF ICCV

    Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 777–786. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00078 . https://doi.ieeecomputersociety.org/10.1109/ICCV...

  35. [36]

    1222–1230 (2023)

    Liu, X., Wu, T., Guo, G.: Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention, pp. 1222–1230 (2023). https://doi.org/ 10.24963/ijcai.2023/136

  36. [37]

    Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

    Marchetti, M., Traini, D., Ursino, D., Virgili, L.: Efficient token pruning in vision transformers using an attention-based multilayer network. Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449

  37. [38]

    In: CVPR (2024)

    Wang, H., Dedhia, B., Jha, N.K.: Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers, pp. 16070–16079 (2024). https://doi.org/10.1109/CVPR52733.2024.01521

  38. [39]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Xu, Y., Zhang, Z., Zhang, M., Sheng, K., Li, K., Dong, W., Zhang, L., Xu, C., Sun, X.: Evo-vit: Slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2964–2972 (2022) 27

  39. [40]

    https://arxiv.org/abs/2311.00586

    Courdier, E., Sivaprasad, P.T., Fleuret, F.: PAUMER: Patch Pausing Trans- former for Semantic Segmentation (2023). https://arxiv.org/abs/2311.00586

  40. [41]

    In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

    Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

  41. [42]

    In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp

    Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmenta- tion. In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp. 2646–2656. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/WACV57701.2024.00264 . https://doi.ieee...

  42. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A- ViT: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10809–10818 (2022)

  43. [45]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Zeng, W., Jin, S., Xu, L., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Tcformer: Visual recognition via token clustering transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  44. [46]

    In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Li, J., Wang, Y., ZHANG, X., Shi, B., Jiang, D., Li, C., Dai, W., Xiong, H., Tian, Q.: Ailurus: A scalable vit framework for dense prediction. In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 30979–30996. Curran Asso- ciates, Inc., ??? (2023). https://proceed...

  45. [47]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 12–21 (2023)

  46. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Norouzi, N., Orlova, S., De Geus, D., Dubbelman, G.: Algm: Adaptive local- then-global token merging for efficient semantic segmentation with plain vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15773–15782 (2024)

  47. [49]

    In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

    Haroun, K., Martinet, J., Chehida, K.B., Allenet, T.: Leveraging local similarity 28 for token merging in vision transformers. In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)

  48. [50]

    In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

    Haroun, K., Allenet, T., Chehida, K.B., Martinet, J.: Dynamic hierarchical token merging for vision transformers. In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

  49. [51]

    In: Conference on Neural Information Processing Systems (2024)

    Lee, D.H., Hong, S.: Learning to merge tokens via decoupled embedding for efficient vision transformers. In: Conference on Neural Information Processing Systems (2024)

  50. [52]

    Transactions on Machine Learning Research (2023)

    Bonnaerens, M., Dambre, J.: Learned thresholds token merging and pruning for vision transformers. Transactions on Machine Learning Research (2023)

  51. [53]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp

    Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383–1392 (2024)

  52. [54]

    Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

    Wu, X., Zeng, F., Wang, X., Chen, X.: PPT: Token Pruning and Pooling for Efficient Vision Transformers (2024). https://arxiv.org/abs/2310.01812

  53. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for efficient vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17164–17174 (2023)

  54. [56]

    Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

    Chen, D., Lin, K., Deng, Q.: Ucc: A unified cascade compression framework for vision transformer models. Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747

  55. [57]

    IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

    Mao, J., Shen, Y., Guo, J., Yao, Y., Hua, X., Shen, H.: Prune and merge: Effi- cient token compression for vision transformer with spatial information preserved. IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405

  56. [58]

    arXiv:2211.11167 (2022)

    Huang, H., Zhou, X., Cao, J., He, R., Tan, T.: Vision transformer with super token sampling. arXiv:2211.11167 (2022)

  57. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111 (2022)

  58. [60]

    IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

    Rao, Y., Liu, Z., Zhao, W., Zhou, J., Lu, J.: Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826

  59. [61]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv abs/1905.11946 (2019)

  60. [62]

    International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) https://doi.org/10.1007/s11263-015-0816-y

  61. [63]

    https://github.com/open-mmlab/ mmsegmentation (2020)

    MMSegmentation Contributors: MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/ mmsegmentation (2020)

  62. [64]

    In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1209–1218. IEEE Computer Society, Los Alamitos, CA, USA (2018). https://doi.org/10.1109/CVPR.2018.00132 . https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00132

  63. [65]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  64. [66]

    In: Proc

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 30