Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
Pith reviewed 2026-05-21 22:22 UTC · model grok-4.3
The pith
STEP merges patches into superpatches via a CNN policy and prunes high-confidence tokens early to reduce Vision Transformer costs in high-resolution segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STEP integrates dCTS, a lightweight CNN-based policy network that enables flexible merging of patches into superpatches, together with early-exit blocks that remove high-confidence supertokens from further encoder processing, yielding up to 2.5 times fewer tokens from dCTS alone and 4 times lower computational complexity overall with at most a 2 percent accuracy drop on images up to 1024 by 1024.
What carries the argument
dCTS, the lightweight CNN-based policy network that decides flexible merging into superpatches, paired with early-exits on high-confidence supertokens inside the encoder blocks.
If this is right
- dCTS alone cuts token count by a factor of 2.5, computational cost by 2.6 times, and raises throughput 3.4 times on a ViT-Large backbone.
- The complete STEP framework reaches 4 times lower computational complexity and 1.7 times faster inference speed.
- Up to 40 percent of tokens can be halted before the final encoder layer under the reported configurations.
- The approach is tested on high-resolution semantic segmentation benchmarks with images as large as 1024 by 1024.
Where Pith is reading between the lines
- The same merging-plus-early-exit pattern could be applied to other dense prediction tasks such as depth estimation or instance segmentation to achieve similar compute savings.
- The locations where tokens exit early may highlight image regions that are easy to classify, suggesting a route to adaptive resolution or region-specific refinement.
- Integrating STEP with model compression methods like quantization could compound the efficiency gains without additional accuracy cost.
Load-bearing premise
The small CNN policy network can choose merges that preserve necessary detail and that early removal of high-confidence tokens will not discard information later layers would need to fix for correct segmentation.
What would settle it
Run the full STEP pipeline on a 1024 by 1024 semantic segmentation test set and measure both final mIoU and effective FLOPs; if accuracy falls more than 2 percent or token count does not drop by at least 2 times relative to standard 16 by 16 patching, the efficiency claims are not supported.
read the original abstract
Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes STEP, a hybrid token-reduction framework for Vision Transformers in high-resolution semantic segmentation. It introduces dCTS, a lightweight CNN-based policy network for dynamic merging of patches into superpatches, combined with early-exits in encoder blocks to prune high-confidence supertokens. On benchmarks with images up to 1024×1024, it claims that dCTS alone reduces token count by 2.5× (yielding 2.6× compute reduction and 3.4× throughput on ViT-Large), while the full STEP framework achieves up to 4× complexity reduction, 1.7× inference speedup, ≤2% accuracy drop, and early halting of up to 40% of tokens.
Significance. If the results hold and the early-pruning decisions prove safe, the work could provide a practical route to lowering the computational burden of ViT-based high-resolution segmentation. The hybrid design targets both spatial redundancy via flexible merging and per-layer computation via confidence-based exits, which is relevant for resource-limited settings. The reported throughput and complexity gains are substantial enough to matter for real-world deployment if the accuracy bound generalizes.
major comments (1)
- The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.
minor comments (2)
- The abstract reports concrete quantitative results (2.5× token reduction, 2.6× cost reduction, etc.) but does not specify the exact datasets, image resolutions tested, number of runs, or error bars, which would help readers assess the robustness of the efficiency numbers.
- The introduction or method section would benefit from an explicit definition and diagram of how dCTS produces flexible superpatches versus standard fixed patching, to clarify the difference from prior token-merging techniques.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the validation of our early-exit strategy.
read point-by-point responses
-
Referee: The headline claims of 4× complexity reduction and ≤2.0% accuracy drop with up to 40% early halting rest on the assumption that intermediate-layer confidence scores are a sufficient proxy for final-task utility, such that pruned supertokens require no later-layer refinement. This assumption is load-bearing for high-resolution inputs where fine boundary details and long-range context are spatially sparse and often resolved only in deeper self-attention layers. No ablation is described that compares confidence-based early-exits against random or uniform token removal at the same depth and token count to isolate whether the selection avoids disproportionate information loss on harder regions or classes.
Authors: We agree that a direct ablation against random or uniform token removal would better isolate the benefit of confidence-based selection and address potential concerns about information loss on boundaries or sparse regions in high-resolution inputs. The current manuscript reports overall accuracy and pruning statistics but does not include this specific controlled comparison at matched depths and token counts. In the revised version we will add such an ablation on the high-resolution benchmarks, removing equivalent numbers of tokens randomly or uniformly at the same encoder stages and measuring accuracy drops (overall and per-class/boundary). We expect this to show larger degradation under random/uniform policies, supporting that our confidence proxy is effective. revision: yes
Circularity Check
No circularity: empirical efficiency claims rest on measured outcomes against baselines
full rationale
The paper presents STEP as an empirical engineering framework combining a lightweight CNN policy network (dCTS) for dynamic superpatch merging with early-exit pruning in ViT encoders for high-resolution semantic segmentation. Reported gains (2.5x token reduction, 4x complexity drop, ≤2% accuracy loss, up to 40% early halts) are stated as experimental results on benchmarks up to 1024×1024 images versus the standard 16×16 patching baseline. No equations, fitted parameters, or self-citations are shown that would reduce these quantities to definitions or inputs internal to the paper itself. The derivation chain is therefore self-contained as a set of measured performance deltas rather than tautological redefinitions.
Axiom & Free-Parameter Ledger
free parameters (2)
- dCTS policy network weights
- early-exit confidence threshold
axioms (1)
- domain assumption Standard 16x16 pixel patching is the appropriate baseline for token count and cost comparisons.
invented entities (2)
-
dCTS (dynamic CNN-based token selector)
no independent evidence
-
supertokens / superpatches
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
work page 2021
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
work page 2021
-
[3]
In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021)
work page 2021
-
[4]
In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pp. 7262–7272 (2021)
work page 2021
-
[5]
Advances in Neural Information Processing Systems 34, 10326–10338 (2021)
Zhang, W., Pang, J., Chen, K., Loy, C.C.: K-net: Towards unified image seg- mentation. Advances in Neural Information Processing Systems 34, 10326–10338 (2021)
work page 2021
-
[6]
Advances in neural information processing systems 34, 12077–12090 (2021)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Seg- former: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)
work page 2021
-
[7]
Zhang, B., Tian, Z., Tang, Q., Chu, X., Wei, X., Shen, C., Liu, Y.: Segvit: Semantic segmentation with plain vision transformers. NeurIPS (2022)
work page 2022
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., A verta, G., Leibe, B., Dubbelman, G., Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)
work page 2025
-
[9]
In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp
Yoo, J., Ko, D., Kim, G.: Ccaseg: Decoding multi-scale context with convolu- tional cross-attention for semantic segmentation. In: Proceedings of the Winter Conference on Applications of Computer Vision (W ACV), pp. 9461–9470 (2025)
work page 2025
-
[10]
In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp
Yeom, S., Klitzing, J.: U‑mixformer: Unet‑like transformer with mix‑attention for efficient semantic segmentation. In: Proceedings of the Winter Conference on 24 Applications of Computer Vision (W ACV), pp. 7710–7719 (2025)
work page 2025
-
[11]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp
Hu, X., Jiang, L., Schiele, B.: Training vision transformers for semi‑supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pp. 4007–4017 (2024)
work page 2024
-
[12]
Lin, Y., Zhang, T., Sun, P., Li, Z., Zhou, S.: Fq-vit: Post-training quantiza- tion for fully quantized vision transformer. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179 (2022)
work page 2022
-
[13]
Yuan, Z., Xue, C., Chen, Y., Wu, Q., Sun, G.: Ptq4vit: Post-training quan- tization for vision transformers with twin uniform quantization. In: Computer Vision – ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XII, pp. 191–207. Springer, Berlin, Heidel- berg (2022). https://doi.org/10.1007/978-3-031-19775-8_1...
-
[14]
In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp
Li, Z., Gu, Q.: I-vit: Integer-only quantization for efficient vision transformer inference. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pp. 17065–17075 (2023)
work page 2023
-
[15]
Transactions on Machine Learning Research (2024)
Huang, X., Shen, Z., Dong, P., Cheng, K.-T.: Quantization variation: A new per- spective on training transformers with low-bit precision. Transactions on Machine Learning Research (2024)
work page 2024
-
[16]
Shang, Y., Liu, G., Kompella, R., Yan, Y.: Quantized-vit efficient train- ing via fisher matrix regularization. In: MultiMedia Modeling: 31st Interna- tional Conference on Multimedia Modeling, MMM 2025, Nara, Japan, January 8–10, 2025, Proceedings, Part III, pp. 270–284. Springer, Berlin, Heidel- berg (2025). https://doi.org/10.1007/978-981-96-2064-7_20 . ...
-
[17]
In: European Conference on Computer Vision (ECCV) (2022)
Wu, K., Zhang, J., Peng, H., Liu, M., Xiao, B., Fu, J., Yuan, L.: Tinyvit: Fast pretraining distillation for small vision transformers. In: European Conference on Computer Vision (ECCV) (2022)
work page 2022
-
[18]
In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp
Yang, Z., Li, Z., Zeng, A., Li, Z., Yuan, C., Li, Y.: ViTKD: Feature-based Knowledge Distillation for Vision Transformers . In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp. 1379–1388. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/CVPRW63382.2024.00145 . https://doi.ieeecompu...
-
[19]
Proust, M., Poreba, M., Szczepanski, M., Haroun, K.: Step: Supertoken and early-pruning for efficient semantic segmentation. In: VISIGRAPP 2025-20th International Joint Conference on Computer Vision, Imaging and Computer 25 Graphics Theory and Applications, pp. 56–61 (2025). https://doi.org/10.5220/ 0013132800003912 . https://www.scitepress.org/Papers/2025...
work page 2025
-
[20]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
work page 2023
-
[21]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Havtorn, J.D., Royer, A., Blankevoort, T., Bejnordi, B.E.: MSViT: Dynamic mixed-scale tokenization for vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 838–848 (2023)
work page 2023
-
[22]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Chen, M., Lin, M., Li, K., Shen, Y., Wu, Y., Chao, F., Ji, R.: Cf-vit: A gen- eral coarse-to-fine method for vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37
-
[23]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Ronen, T., Levy, O., Golbert, A.: Vision transformers with mixed-resolution tok- enization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4613–4622
-
[24]
https: //arxiv.org/abs/2403.16020
Mahmud, T., Yaman, B., Liu, C.-H., Marculescu, D.: PaPr: Training-Free One- Step Patch Pruning with Lightweight ConvNets for Faster Inference (2024). https: //arxiv.org/abs/2403.16020
-
[25]
In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.-J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
work page 2021
-
[26]
European Conference on Computer Vision (ECCV) (2022)
Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision (ECCV) (2022)
work page 2022
-
[27]
In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., Keutzer, K.: Learned token pruning for transformers. In: Proceed- ings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’22, pp. 784–794. Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3534678.3539260 . https://doi.or...
-
[28]
: Spvit: Enabling faster vision transformers via latency-aware soft token pruning
Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al. : Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel A viv, Israel, October 23–27, 2022, Proceedings, Part XI, pp. 620–640 (2022). Springer
work page 2022
-
[29]
In: Interna- tional Conference on Learning Representations (2022)
Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Not all patches are what you need: Expediting vision transformers via token reorganizations. In: Interna- tional Conference on Learning Representations (2022). https://openreview.net/ 26 forum?id=BjyvwnXXVn_
work page 2022
-
[30]
Meng, L., Li, H., Chen, B.-C., Lan, S., Wu, Z., Jiang, Y.-G., Lim, S.- N.: AdaViT: Adaptive Vision Transformers for Efficient Image Recog- nition . In: 2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pp. 12299–12308. IEEE Computer Society, Los Alamitos, CA, USA (2022). https://doi.org/10.1109/CVPR52688.2022.01199 . https://doi...
-
[31]
Song, Z., Xu, Y., He, Z., Jiang, L., Jing, N., Liang, X.: CP-ViT: Cascade Vision Transformer Pruning Via Progressive Sparsity Prediction. https://doi.org/10. 48550/arXiv.2203.04570
-
[32]
360mvsnet: Deep multi-view stereo network with 360° images for indoor scene reconstruction,
Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 12–21 (2023). https://doi.org/10.1109/WACV56688.2023.00010
-
[33]
In: International Conference on Learning Representations (2023)
Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Representations (2023)
work page 2023
-
[34]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 777–786. IEEE Computer Society, Los Alamitos, CA, USA (2023). https://doi.org/10.1109/ICCV51070.2023.00078 . https://doi.ieeecomputersociety.org/10.1109/ICCV...
-
[36]
Liu, X., Wu, T., Guo, G.: Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention, pp. 1222–1230 (2023). https://doi.org/ 10.24963/ijcai.2023/136
-
[37]
Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449
Marchetti, M., Traini, D., Ursino, D., Virgili, L.: Efficient token pruning in vision transformers using an attention-based multilayer network. Expert Systems with Applications 279, 127449 (2025) https://doi.org/10.1016/j.eswa.2025.127449
-
[38]
Emogen: Emotional image content generation with text-to-image diffusion models,
Wang, H., Dedhia, B., Jha, N.K.: Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers, pp. 16070–16079 (2024). https://doi.org/10.1109/CVPR52733.2024.01521
-
[39]
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol
Xu, Y., Zhang, Z., Zhang, M., Sheng, K., Li, K., Dong, W., Zhang, L., Xu, C., Sun, X.: Evo-vit: Slow-fast token evolution for dynamic vision transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2964–2972 (2022) 27
work page 2022
-
[40]
https://arxiv.org/abs/2311.00586
Courdier, E., Sivaprasad, P.T., Fleuret, F.: PAUMER: Patch Pausing Trans- former for Semantic Segmentation (2023). https://arxiv.org/abs/2311.00586
-
[41]
In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp
Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV), pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184
-
[42]
In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmenta- tion. In: 2024 IEEE/CVF Winter Conference on Applications of Com- puter Vision (W ACV), pp. 2646–2656. IEEE Computer Society, Los Alami- tos, CA, USA (2024). https://doi.org/10.1109/WACV57701.2024.00264 . https://doi.ieee...
-
[43]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A- ViT: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10809–10818 (2022)
work page 2022
-
[45]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
Zeng, W., Jin, S., Xu, L., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Tcformer: Visual recognition via token clustering transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
work page 2024
-
[46]
In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Li, J., Wang, Y., ZHANG, X., Shi, B., Jiang, D., Li, C., Dai, W., Xiong, H., Tian, Q.: Ailurus: A scalable vit framework for dense prediction. In: Oh, A., Nau- mann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 30979–30996. Curran Asso- ciates, Inc., ??? (2023). https://proceed...
work page 2023
-
[47]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp
Marin, D., Chang, J.-H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers for image classification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 12–21 (2023)
work page 2023
-
[48]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Norouzi, N., Orlova, S., De Geus, D., Dubbelman, G.: Algm: Adaptive local- then-global token merging for efficient semantic segmentation with plain vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15773–15782 (2024)
work page 2024
-
[49]
In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)
Haroun, K., Martinet, J., Chehida, K.B., Allenet, T.: Leveraging local similarity 28 for token merging in vision transformers. In: ICONIP 2024-31th International Conference on Neural Information Processing (2024)
work page 2024
-
[50]
Haroun, K., Allenet, T., Chehida, K.B., Martinet, J.: Dynamic hierarchical token merging for vision transformers. In: VISAPP-2025-20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)
work page 2025
-
[51]
In: Conference on Neural Information Processing Systems (2024)
Lee, D.H., Hong, S.: Learning to merge tokens via decoupled embedding for efficient vision transformers. In: Conference on Neural Information Processing Systems (2024)
work page 2024
-
[52]
Transactions on Machine Learning Research (2023)
Bonnaerens, M., Dambre, J.: Learned thresholds token merging and pruning for vision transformers. Transactions on Machine Learning Research (2023)
work page 2023
-
[53]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp
Kim, M., Gao, S., Hsu, Y.-C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1383–1392 (2024)
work page 2024
-
[54]
Wu, X., Zeng, F., Wang, X., Chen, X.: PPT: Token Pruning and Pooling for Efficient Vision Transformers (2024). https://arxiv.org/abs/2310.01812
-
[55]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for efficient vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17164–17174 (2023)
work page 2023
-
[56]
Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747
Chen, D., Lin, K., Deng, Q.: Ucc: A unified cascade compression framework for vision transformer models. Neurocomputing 612, 128747 (2025) https://doi.org/ 10.1016/j.neucom.2024.128747
-
[57]
IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405
Mao, J., Shen, Y., Guo, J., Yao, Y., Hua, X., Shen, H.: Prune and merge: Effi- cient token compression for vision transformer with spatial information preserved. IEEE Transactions on Multimedia PP, 1–14 (2025) https://doi.org/10.1109/ TMM.2025.3535405
-
[58]
Huang, H., Zhou, X., Cao, J., He, R., Tan, T.: Vision transformer with super token sampling. arXiv:2211.11167 (2022)
-
[59]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Zeng, W., Jin, S., Liu, W., Qian, C., Luo, P., Ouyang, W., Wang, X.: Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11101–11111 (2022)
work page 2022
-
[60]
Rao, Y., Liu, Z., Zhao, W., Zhou, J., Lu, J.: Dynamic spatial sparsification for efficient vision transformers and convolutional neural networks. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 45(9), 10883–10897 (2023) 29 https://doi.org/10.1109/TPAMI.2023.3263826
-
[61]
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. ArXiv abs/1905.11946 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[62]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) https://doi.org/10.1007/s11263-015-0816-y
-
[63]
https://github.com/open-mmlab/ mmsegmentation (2020)
MMSegmentation Contributors: MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/ mmsegmentation (2020)
work page 2020
-
[64]
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1209–1218. IEEE Computer Society, Los Alamitos, CA, USA (2018). https://doi.org/10.1109/CVPR.2018.00132 . https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00132
-
[65]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
work page 2017
- [66]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.