When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression

Ngai-Man Cheung; Tien-Phat Nguyen

arxiv: 2607.02237 · v1 · pith:SZE2KM4Anew · submitted 2026-07-02 · 💻 cs.CV

When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression

Tien-Phat Nguyen , Ngai-Man Cheung This is my paper

Pith reviewed 2026-07-03 15:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords token compressionstructural pruningVision Transformersemantic segmentationmodel compressionrobustnesshigh compressionADE20K

0 comments

The pith

Token compression in ViT segmentation works at mild rates but collapses under severe compression, while structural pruning degrades more smoothly and a prune-then-merge pipeline improves the accuracy-robustness trade-off.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks representative token compression and structural pruning methods for Vision Transformer semantic segmentation under matched FLOPs. Experiments on ADE20K, Cityscapes and their common-corruption variants show token compression delivers strong results at low-to-moderate compression but loses accuracy sharply at high rates, consistent with information loss. Structural pruning exhibits smoother performance decline and greater stability at aggressive compression levels. A combined prune-then-merge strategy, applying moderate token compression after moderate pruning, yields better accuracy and robustness at high compression than either approach alone.

Core claim

Token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. A prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone consistently achieves a better accuracy-robustness trade-off at high compression on both clean and corrupted inputs.

What carries the argument

Matched-FLOPs comparison of token compression versus structural pruning on corrupted segmentation benchmarks (ADE20K-C, Cityscapes-C), with the prune-then-merge pipeline as the proposed practical combination.

If this is right

Token compression should be restricted to moderate ratios to avoid large accuracy drops on both clean and corrupted data.
Structural pruning provides a more reliable route to extreme efficiency when high compression is required.
Combining moderate pruning with moderate token compression produces a superior accuracy-robustness operating point at high compression.
The relative stability of pruning holds across both clean and corrupted inputs under matched computational budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid efficiency recipes may be worth testing on other dense prediction tasks such as object detection or depth estimation.
Real-world deployment pipelines could incorporate input-quality checks to decide between pruning-heavy and compression-heavy operating modes.
The information-loss interpretation suggests that token-merging heuristics might be redesigned to preserve semantic boundaries under corruption.

Load-bearing premise

The specific representative token-compression and structural-pruning methods, the matched-FLOPs protocol, and the common-corruption variants of ADE20K and Cityscapes are sufficient to reveal the general behavior of the two efficiency approaches under aggressive compression.

What would settle it

A new token compression method that maintains segmentation accuracy and corruption robustness at the highest tested compression ratios on ADE20K-C and Cityscapes-C under the same matched-FLOPs protocol would contradict the observed sharp degradation.

Figures

Figures reproduced from arXiv: 2607.02237 by Ngai-Man Cheung, Tien-Phat Nguyen.

**Figure 2.** Figure 2: Accuracy-compute trade-offs on ADE20K (left) and Cityscapes (right). We compare structural pruning (NViT), token compression methods (ToMe, ALGM, CTS), and the stack pipeline (NViT + ToMe). The top row evaluates clean accuracy mIoUclean, while the bottom row evaluates robustness under common corruptions mIoUnoise. In both datasets, mild token compression can preserve performance, but aggressive token comp… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison under aggressive compression on ADE20K. We show two representative examples with zoomed regions highlighting fine structures and local boundaries. Compared with NViT and prune-then-merge, ALGM produces less spatially coherent predictions in these regions. 4.3 Why does aggressive token compression break? 120 100 80 60 40 20 0 GFLOPs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Normalized Effective Ra… view at source ↗

**Figure 4.** Figure 4: Normalized effective rank on ADE20K-val (markers and bands show mean ± std over images). Effective-rank analysis. We further analyze feature diversity using entropybased effective rank [27] as a spectral diagnostic of representation dimensionality and collapse in dense prediction [3]. For each image, we reconstruct the encoder features on the original token grid and denote the resulting feature mat… view at source ↗

read the original abstract

Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at https://github.com/phatnguyencs/vit-seg-compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token compression drops sharply at high ratios on ViT segmentation while pruning stays steadier, and the prune-then-merge hybrid improves the accuracy-robustness trade-off on corrupted inputs.

read the letter

The paper's main finding is that token compression works at mild ratios but degrades fast once compression gets aggressive on semantic segmentation, while structural pruning shows a smoother drop and holds up better. The prune-then-merge pipeline then combines moderate versions of both and beats the single-technique baselines at high compression on both clean and corrupted data.

What stands out is the direct matched-FLOPs comparison across ADE20K, Cityscapes, and their corruption variants. Prior work apparently stayed at lower compression levels, so this head-to-head at the aggressive end plus the robustness angle is a useful empirical extension. The trends are reported as consistent across the two datasets, and releasing the code helps anyone who wants to check or extend it.

The soft spots are limited. The chosen representative methods for each family are reasonable for the claim, but someone could still ask whether every token compressor or pruner would behave the same way. The abstract does not mention statistical tests or full ablation tables, so the strength of the curves rests on the observed patterns rather than formal significance. No equations or fitted predictions create circularity issues.

This is for people working on efficient ViT deployment for segmentation, especially when inputs may be noisy or resources are tight. A reader who needs practical guidance on high-compression choices will get something concrete from the benchmark and the hybrid recipe.

It deserves peer review. The protocol is fair, the observation is actionable within the efficiency subfield, and the work is coherent on its own terms.

Referee Report

1 major / 2 minor

Summary. The paper benchmarks representative token compression and structural pruning methods for ViT-based semantic segmentation under a matched-FLOPs protocol on ADE20K, Cityscapes, and their common-corruption variants (ADE20K-C, Cityscapes-C). It reports that token compression performs well at mild compression ratios but degrades sharply at high compression due to information loss, while structural pruning shows smoother degradation and greater stability; a prune-then-merge pipeline combining moderate pruning with token compression yields a superior accuracy-robustness trade-off at high compression levels.

Significance. If the observed trends hold under the described protocol, the work supplies actionable empirical guidance for deploying efficient ViT segmentation models in resource-constrained settings that also require robustness to input corruptions. The matched-FLOPs comparison, inclusion of corruption benchmarks, and open-sourced code are strengths that support reproducibility and practical utility.

major comments (1)

[Experimental protocol] Experimental protocol (assumed §4): the central claim that the observed trends reveal general behavior of the two efficiency families rests on the representativeness of the selected token-compression and structural-pruning methods; the manuscript should provide explicit justification or sensitivity analysis showing that alternative methods within each family produce qualitatively similar degradation curves, otherwise the generalizability of the prune-then-merge recommendation is weakened.

minor comments (2)

[Abstract] Abstract: the specific token-compression and pruning algorithms chosen as representatives are not named, which reduces immediate clarity about the scope of the benchmark.
[Results] Results presentation: tables or figures reporting the accuracy-robustness trade-off for the prune-then-merge pipeline should include error bars or multiple-run statistics to confirm that the reported gains are stable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical utility. We address the single major comment below and will incorporate the suggested clarification in the revision.

read point-by-point responses

Referee: [Experimental protocol] Experimental protocol (assumed §4): the central claim that the observed trends reveal general behavior of the two efficiency families rests on the representativeness of the selected token-compression and structural-pruning methods; the manuscript should provide explicit justification or sensitivity analysis showing that alternative methods within each family produce qualitatively similar degradation curves, otherwise the generalizability of the prune-then-merge recommendation is weakened.

Authors: We agree that explicit justification strengthens the generalizability claim. In the revised manuscript we will expand the experimental protocol section with a new paragraph that (i) motivates the selected representatives by their prevalence in the literature and coverage of core mechanisms (token merging/pruning for compression; head/layer/channel removal for structural pruning), and (ii) cites prior studies reporting qualitatively similar sharp vs. smooth degradation curves under high compression for other methods in each family. This addition directly addresses the concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain

full rationale

The paper is a standard empirical benchmark study comparing token compression and structural pruning methods for ViT-based semantic segmentation. It reports observed trends from experiments under a matched-FLOPs protocol on ADE20K, Cityscapes and their corruption variants, then proposes a prune-then-merge pipeline as a practical outcome of those observations. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation load-bearing steps appear in the abstract or described design. The central claims rest on external experimental results rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claims rest on standard assumptions about ViT architectures, FLOPs as a proxy for efficiency, and the representativeness of the chosen methods and corruption benchmarks; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Standard ViT backbones, semantic-segmentation heads, and evaluation protocols on ADE20K/Cityscapes remain valid when compression is applied.
The paper invokes these established components without additional justification.

pith-pipeline@v0.9.1-grok · 5790 in / 1351 out tokens · 27728 ms · 2026-07-03T15:49:51.190238+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 8 canonical work pages

[1]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

2023
[2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

2023
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Chen, L., Gu, L., Fu, Y.: Frequency-dynamic attention modulation for dense pre- diction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

2025
[4]

arXiv preprint arXiv:2305.17997 (2023)

Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for efficient vision transformers. arXiv preprint arXiv:2305.17997 (2023)

work page arXiv 2023
[5]

In: CVPR (2022)

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

2022
[6]

In: NeurIPS (2021)

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)

2021
[7]

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding (2016)

2016
[8]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021)

2021
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fang,G.,Ma,X.,Song,M.,Mi,M.B.,Wang,X.:Depgraph:Towardsanystructural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16091–16101 (2023) Token Compression vs. Structural Pruning for ViT Segmentation 15

2023
[10]

arXiv preprint arXiv:2407.04616 (2024)

Fang, G., Ma, X.T., Mi, M.B., Wang, X.: Isomorphic pruning for vision models. arXiv preprint arXiv:2407.04616 (2024)

work page arXiv 2024
[11]

European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision (ECCV) (2022)

2022
[12]

In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW)

Haurum, J.B., Escalera, S., Taylor, G.W., Moeslund, T.B.: Which tokens to use? investigating token reduction in vision transformers. In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW). pp. 773–783 (2023). https://doi.org/10.1109/ICCVW60793.2023.00085

work page doi:10.1109/iccvw60793.2023.00085 2023
[13]

Hooker, S., Courville, A., Clark, G., Dauphin, Y., Frome, A.: What do compressed deep neural networks forget? arXiv preprint arXiv:1911.05248 (2019)

work page arXiv 1911
[14]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Hou, Z., Kung, S.Y.: Multi-dimensional vision transformer compression via de- pendency guided gaussian process search. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 3668–3677 (2022).https://doi.org/10.1109/CVPRW56347.2022.00411

work page doi:10.1109/cvprw56347.2022.00411 2022
[15]

In: ICLR (2024)

Huang, H., Campello, R.J.G.B., Erfani, S.M., Ma, X., Houle, M.E., Bailey, J.: Ldreg: Local dimensionality regularized self-supervised learning. In: ICLR (2024)

2024
[16]

In: CVPR (2023)

Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: OneFormer: One Trans- former to Rule Universal Image Segmentation. In: CVPR (2023)

2023
[17]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). p. 8825–8835. IEEE (Jun 2020).https://doi.org/10.1109/ cvpr42600.2020.00885

work page arXiv 2020
[18]

In: IEEE Winter Conf

Kim, M., Gao, S., Hsu, Y.C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1372–1381 (2024).https:// doi.org/10.1109/WACV57701.2024.00141

work page doi:10.1109/wacv57701.2024.00141 2024
[19]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023)

2023
[20]

In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al.: Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. pp. 620–640. Springer (2022)

2022
[21]

In: International Conference on Learning Representations (2022)

Liang,Y.,Ge,C.,Tong,Z.,Song,Y.,Wang,J.,Xie,P.:Notallpatchesarewhatyou need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)

2022
[22]

In: IEEE Winter Conf

Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

work page doi:10.1109/wacv57701.2024.00184 2024
[23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

2022
[24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023
[25]

In: Proceedings of the European conference on computer vision (ECCV)

Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV). pp. 116–131 (2018) 16 T.-P. Nguyen and N.-M. Cheung

2018
[26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

Norouzi, N., Sorlova, S., de Geus, D., Dubbelman, G.: ALGM: Adaptive Local- then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

2024
[27]

In: 2007 15th European signal processing conference

Roy, O., Vetterli, M.: The effective rank: A measure of effective dimensionality. In: 2007 15th European signal processing conference. pp. 606–610. IEEE (2007)

2007
[28]

In: Advances in Neural Information Processing Systems (2022)

Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., Alvarez, J.: Structural prun- ing via latency-saliency knapsack. In: Advances in Neural Information Processing Systems (2022)

2022
[29]

In: Proceedings of the 40th International Conference on Machine Learning

Shi, D., Tao, C., Jin, Y., Yang, Z., Yuan, C., Wang, J.: UPop: Unified and progres- sive pruning for compressing vision-language transformers. In: Proceedings of the 40th International Conference on Machine Learning. vol. 202, pp. 31292–31311. PMLR (2023)

2023
[30]

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for seman- tic segmentation (2021)

2021
[31]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 777–786 (2023)

2023
[32]

Tang, Y., Wang, Y., Guo, J., Tu, Z., Han, K., Hu, H., Tao, D.: A survey on transformer compression (2024)

2024
[33]

In: Meila, M., Zhang, T

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021)

2021
[34]

In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

Wang, Z., Luo, H., WANG, P., Ding, F., Wang, F., Li, H.: VTC-LFC: Vision trans- former compression with low-frequency components. In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

2022
[35]

In: Neural Information Processing Systems (NeurIPS) (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Neural Information Processing Systems (NeurIPS) (2021)

2021
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18547– 18557 (June 2023)

2023
[37]

IEEE Conf

Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G., Shen, C.: Topformer:Tokenpyramidtransformerformobilesemanticsegmentation.In:Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2022)

2022
[38]

Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

2022
[39]

In: CVPR (2021)

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., Zhang, L.: Rethinking semantic segmentation from a sequence-to- sequence perspective with transformers. In: CVPR (2021)

2021
[40]

International Journal of Computer Vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

2019
[41]

In: International Conference on Learning Representations (2023)

Zhuo, Z., Wang, Y., Ma, J., Wang, Y.: Towards a unified theoretical understand- ing of non-contrastive learning via rank differential mechanism. In: International Conference on Learning Representations (2023)

2023

[1] [1]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

2023

[2] [2]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

2023

[3] [3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Chen, L., Gu, L., Fu, Y.: Frequency-dynamic attention modulation for dense pre- diction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

2025

[4] [4]

arXiv preprint arXiv:2305.17997 (2023)

Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for efficient vision transformers. arXiv preprint arXiv:2305.17997 (2023)

work page arXiv 2023

[5] [5]

In: CVPR (2022)

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

2022

[6] [6]

In: NeurIPS (2021)

Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)

2021

[7] [7]

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding (2016)

2016

[8] [8]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021)

2021

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fang,G.,Ma,X.,Song,M.,Mi,M.B.,Wang,X.:Depgraph:Towardsanystructural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16091–16101 (2023) Token Compression vs. Structural Pruning for ViT Segmentation 15

2023

[10] [10]

arXiv preprint arXiv:2407.04616 (2024)

Fang, G., Ma, X.T., Mi, M.B., Wang, X.: Isomorphic pruning for vision models. arXiv preprint arXiv:2407.04616 (2024)

work page arXiv 2024

[11] [11]

European Conference on Computer Vision (ECCV) (2022)

Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision (ECCV) (2022)

2022

[12] [12]

In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW)

Haurum, J.B., Escalera, S., Taylor, G.W., Moeslund, T.B.: Which tokens to use? investigating token reduction in vision transformers. In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW). pp. 773–783 (2023). https://doi.org/10.1109/ICCVW60793.2023.00085

work page doi:10.1109/iccvw60793.2023.00085 2023

[13] [13]

Hooker, S., Courville, A., Clark, G., Dauphin, Y., Frome, A.: What do compressed deep neural networks forget? arXiv preprint arXiv:1911.05248 (2019)

work page arXiv 1911

[14] [14]

In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Hou, Z., Kung, S.Y.: Multi-dimensional vision transformer compression via de- pendency guided gaussian process search. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 3668–3677 (2022).https://doi.org/10.1109/CVPRW56347.2022.00411

work page doi:10.1109/cvprw56347.2022.00411 2022

[15] [15]

In: ICLR (2024)

Huang, H., Campello, R.J.G.B., Erfani, S.M., Ma, X., Houle, M.E., Bailey, J.: Ldreg: Local dimensionality regularized self-supervised learning. In: ICLR (2024)

2024

[16] [16]

In: CVPR (2023)

Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: OneFormer: One Trans- former to Rule Universal Image Segmentation. In: CVPR (2023)

2023

[17] [17]

In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). p. 8825–8835. IEEE (Jun 2020).https://doi.org/10.1109/ cvpr42600.2020.00885

work page arXiv 2020

[18] [18]

In: IEEE Winter Conf

Kim, M., Gao, S., Hsu, Y.C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1372–1381 (2024).https:// doi.org/10.1109/WACV57701.2024.00141

work page doi:10.1109/wacv57701.2024.00141 2024

[19] [19]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023)

2023

[20] [20]

In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI

Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al.: Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. pp. 620–640. Springer (2022)

2022

[21] [21]

In: International Conference on Learning Representations (2022)

Liang,Y.,Ge,C.,Tong,Z.,Song,Y.,Wang,J.,Xie,P.:Notallpatchesarewhatyou need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)

2022

[22] [22]

In: IEEE Winter Conf

Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

work page doi:10.1109/wacv57701.2024.00184 2024

[23] [23]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

2022

[24] [24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

2023

[25] [25]

In: Proceedings of the European conference on computer vision (ECCV)

Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV). pp. 116–131 (2018) 16 T.-P. Nguyen and N.-M. Cheung

2018

[26] [26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

Norouzi, N., Sorlova, S., de Geus, D., Dubbelman, G.: ALGM: Adaptive Local- then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

2024

[27] [27]

In: 2007 15th European signal processing conference

Roy, O., Vetterli, M.: The effective rank: A measure of effective dimensionality. In: 2007 15th European signal processing conference. pp. 606–610. IEEE (2007)

2007

[28] [28]

In: Advances in Neural Information Processing Systems (2022)

Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., Alvarez, J.: Structural prun- ing via latency-saliency knapsack. In: Advances in Neural Information Processing Systems (2022)

2022

[29] [29]

In: Proceedings of the 40th International Conference on Machine Learning

Shi, D., Tao, C., Jin, Y., Yang, Z., Yuan, C., Wang, J.: UPop: Unified and progres- sive pruning for compressing vision-language transformers. In: Proceedings of the 40th International Conference on Machine Learning. vol. 202, pp. 31292–31311. PMLR (2023)

2023

[30] [30]

Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for seman- tic segmentation (2021)

2021

[31] [31]

2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 777–786 (2023)

2023

[32] [32]

Tang, Y., Wang, Y., Guo, J., Tu, Z., Han, K., Hu, H., Tao, D.: A survey on transformer compression (2024)

2024

[33] [33]

In: Meila, M., Zhang, T

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021)

2021

[34] [34]

In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

Wang, Z., Luo, H., WANG, P., Ding, F., Wang, F., Li, H.: VTC-LFC: Vision trans- former compression with low-frequency components. In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

2022

[35] [35]

In: Neural Information Processing Systems (NeurIPS) (2021)

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Neural Information Processing Systems (NeurIPS) (2021)

2021

[36] [36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18547– 18557 (June 2023)

2023

[37] [37]

IEEE Conf

Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G., Shen, C.: Topformer:Tokenpyramidtransformerformobilesemanticsegmentation.In:Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2022)

2022

[38] [38]

Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

2022

[39] [39]

In: CVPR (2021)

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., Zhang, L.: Rethinking semantic segmentation from a sequence-to- sequence perspective with transformers. In: CVPR (2021)

2021

[40] [40]

International Journal of Computer Vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

2019

[41] [41]

In: International Conference on Learning Representations (2023)

Zhuo, Z., Wang, Y., Ma, J., Wang, Y.: Towards a unified theoretical understand- ing of non-contrastive learning via rank differential mechanism. In: International Conference on Learning Representations (2023)

2023