pith. sign in

arxiv: 2607.02237 · v1 · pith:SZE2KM4Anew · submitted 2026-07-02 · 💻 cs.CV

When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression

Pith reviewed 2026-07-03 15:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords token compressionstructural pruningVision Transformersemantic segmentationmodel compressionrobustnesshigh compressionADE20K
0
0 comments X

The pith

Token compression in ViT segmentation works at mild rates but collapses under severe compression, while structural pruning degrades more smoothly and a prune-then-merge pipeline improves the accuracy-robustness trade-off.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks representative token compression and structural pruning methods for Vision Transformer semantic segmentation under matched FLOPs. Experiments on ADE20K, Cityscapes and their common-corruption variants show token compression delivers strong results at low-to-moderate compression but loses accuracy sharply at high rates, consistent with information loss. Structural pruning exhibits smoother performance decline and greater stability at aggressive compression levels. A combined prune-then-merge strategy, applying moderate token compression after moderate pruning, yields better accuracy and robustness at high compression than either approach alone.

Core claim

Token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. A prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone consistently achieves a better accuracy-robustness trade-off at high compression on both clean and corrupted inputs.

What carries the argument

Matched-FLOPs comparison of token compression versus structural pruning on corrupted segmentation benchmarks (ADE20K-C, Cityscapes-C), with the prune-then-merge pipeline as the proposed practical combination.

If this is right

  • Token compression should be restricted to moderate ratios to avoid large accuracy drops on both clean and corrupted data.
  • Structural pruning provides a more reliable route to extreme efficiency when high compression is required.
  • Combining moderate pruning with moderate token compression produces a superior accuracy-robustness operating point at high compression.
  • The relative stability of pruning holds across both clean and corrupted inputs under matched computational budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid efficiency recipes may be worth testing on other dense prediction tasks such as object detection or depth estimation.
  • Real-world deployment pipelines could incorporate input-quality checks to decide between pruning-heavy and compression-heavy operating modes.
  • The information-loss interpretation suggests that token-merging heuristics might be redesigned to preserve semantic boundaries under corruption.

Load-bearing premise

The specific representative token-compression and structural-pruning methods, the matched-FLOPs protocol, and the common-corruption variants of ADE20K and Cityscapes are sufficient to reveal the general behavior of the two efficiency approaches under aggressive compression.

What would settle it

A new token compression method that maintains segmentation accuracy and corruption robustness at the highest tested compression ratios on ADE20K-C and Cityscapes-C under the same matched-FLOPs protocol would contradict the observed sharp degradation.

Figures

Figures reproduced from arXiv: 2607.02237 by Ngai-Man Cheung, Tien-Phat Nguyen.

Figure 1
Figure 1. Figure 1: Compression strategies and their robustness-compute trade-off. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy-compute trade-offs on ADE20K (left) and Cityscapes (right). We compare structural pruning (NViT), token compression methods (ToMe, ALGM, CTS), and the stack pipeline (NViT + ToMe). The top row evaluates clean accu￾racy mIoUclean, while the bottom row evaluates robustness under common corruptions mIoUnoise. In both datasets, mild token compression can preserve performance, but aggressive token comp… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison under aggressive compression on ADE20K. We show two representative examples with zoomed regions highlighting fine structures and local boundaries. Compared with NViT and prune-then-merge, ALGM produces less spa￾tially coherent predictions in these regions. 4.3 Why does aggressive token compression break? 120 100 80 60 40 20 0 GFLOPs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Normalized Effective Ra… view at source ↗
Figure 4
Figure 4. Figure 4: Normalized effective rank on ADE20K-val (markers and bands show mean ± std over images). Effective-rank analysis. We further an￾alyze feature diversity using entropy￾based effective rank [27] as a spec￾tral diagnostic of representation dimen￾sionality and collapse in dense predic￾tion [3]. For each image, we recon￾struct the encoder features on the orig￾inal token grid and denote the result￾ing feature mat… view at source ↗
read the original abstract

Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at https://github.com/phatnguyencs/vit-seg-compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper benchmarks representative token compression and structural pruning methods for ViT-based semantic segmentation under a matched-FLOPs protocol on ADE20K, Cityscapes, and their common-corruption variants (ADE20K-C, Cityscapes-C). It reports that token compression performs well at mild compression ratios but degrades sharply at high compression due to information loss, while structural pruning shows smoother degradation and greater stability; a prune-then-merge pipeline combining moderate pruning with token compression yields a superior accuracy-robustness trade-off at high compression levels.

Significance. If the observed trends hold under the described protocol, the work supplies actionable empirical guidance for deploying efficient ViT segmentation models in resource-constrained settings that also require robustness to input corruptions. The matched-FLOPs comparison, inclusion of corruption benchmarks, and open-sourced code are strengths that support reproducibility and practical utility.

major comments (1)
  1. [Experimental protocol] Experimental protocol (assumed §4): the central claim that the observed trends reveal general behavior of the two efficiency families rests on the representativeness of the selected token-compression and structural-pruning methods; the manuscript should provide explicit justification or sensitivity analysis showing that alternative methods within each family produce qualitatively similar degradation curves, otherwise the generalizability of the prune-then-merge recommendation is weakened.
minor comments (2)
  1. [Abstract] Abstract: the specific token-compression and pruning algorithms chosen as representatives are not named, which reduces immediate clarity about the scope of the benchmark.
  2. [Results] Results presentation: tables or figures reporting the accuracy-robustness trade-off for the prune-then-merge pipeline should include error bars or multiple-run statistics to confirm that the reported gains are stable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's practical utility. We address the single major comment below and will incorporate the suggested clarification in the revision.

read point-by-point responses
  1. Referee: [Experimental protocol] Experimental protocol (assumed §4): the central claim that the observed trends reveal general behavior of the two efficiency families rests on the representativeness of the selected token-compression and structural-pruning methods; the manuscript should provide explicit justification or sensitivity analysis showing that alternative methods within each family produce qualitatively similar degradation curves, otherwise the generalizability of the prune-then-merge recommendation is weakened.

    Authors: We agree that explicit justification strengthens the generalizability claim. In the revised manuscript we will expand the experimental protocol section with a new paragraph that (i) motivates the selected representatives by their prevalence in the literature and coverage of core mechanisms (token merging/pruning for compression; head/layer/channel removal for structural pruning), and (ii) cites prior studies reporting qualitatively similar sharp vs. smooth degradation curves under high compression for other methods in each family. This addition directly addresses the concern without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain

full rationale

The paper is a standard empirical benchmark study comparing token compression and structural pruning methods for ViT-based semantic segmentation. It reports observed trends from experiments under a matched-FLOPs protocol on ADE20K, Cityscapes and their corruption variants, then proposes a prune-then-merge pipeline as a practical outcome of those observations. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation load-bearing steps appear in the abstract or described design. The central claims rest on external experimental results rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claims rest on standard assumptions about ViT architectures, FLOPs as a proxy for efficiency, and the representativeness of the chosen methods and corruption benchmarks; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Standard ViT backbones, semantic-segmentation heads, and evaluation protocols on ADE20K/Cityscapes remain valid when compression is applied.
    The paper invokes these established components without additional justification.

pith-pipeline@v0.9.1-grok · 5790 in / 1351 out tokens · 27728 ms · 2026-07-03T15:49:51.190238+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 8 canonical work pages

  1. [1]

    In: International Conference on Learning Represen- tations (2023)

    Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

    Chen, L., Gu, L., Fu, Y.: Frequency-dynamic attention modulation for dense pre- diction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

  4. [4]

    arXiv preprint arXiv:2305.17997 (2023)

    Chen, M., Shao, W., Xu, P., Lin, M., Zhang, K., Chao, F., Ji, R., Qiao, Y., Luo, P.: Diffrate: Differentiable compression rate for efficient vision transformers. arXiv preprint arXiv:2305.17997 (2023)

  5. [5]

    In: CVPR (2022)

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

  6. [6]

    In: NeurIPS (2021)

    Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)

  7. [7]

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding (2016)

  8. [8]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Fang,G.,Ma,X.,Song,M.,Mi,M.B.,Wang,X.:Depgraph:Towardsanystructural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16091–16101 (2023) Token Compression vs. Structural Pruning for ViT Segmentation 15

  10. [10]

    arXiv preprint arXiv:2407.04616 (2024)

    Fang, G., Ma, X.T., Mi, M.B., Wang, X.: Isomorphic pruning for vision models. arXiv preprint arXiv:2407.04616 (2024)

  11. [11]

    European Conference on Computer Vision (ECCV) (2022)

    Fayyaz, M., Abbasi Kouhpayegani, S., Rezaei Jafari, F., Sommerlade, E., Vaezi Joze, H.R., Pirsiavash, H., Gall, J.: Adaptive token sampling for efficient vision transformers. European Conference on Computer Vision (ECCV) (2022)

  12. [12]

    In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW)

    Haurum, J.B., Escalera, S., Taylor, G.W., Moeslund, T.B.: Which tokens to use? investigating token reduction in vision transformers. In: 2023 IEEE/CVF Interna- tional Conference on Computer Vision Workshops (ICCVW). pp. 773–783 (2023). https://doi.org/10.1109/ICCVW60793.2023.00085

  13. [13]

    Hooker, S., Courville, A., Clark, G., Dauphin, Y., Frome, A.: What do compressed deep neural networks forget? arXiv preprint arXiv:1911.05248 (2019)

  14. [14]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Hou, Z., Kung, S.Y.: Multi-dimensional vision transformer compression via de- pendency guided gaussian process search. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 3668–3677 (2022).https://doi.org/10.1109/CVPRW56347.2022.00411

  15. [15]

    In: ICLR (2024)

    Huang, H., Campello, R.J.G.B., Erfani, S.M., Ma, X., Houle, M.E., Bailey, J.: Ldreg: Local dimensionality regularized self-supervised learning. In: ICLR (2024)

  16. [16]

    In: CVPR (2023)

    Jain, J., Li, J., Chiu, M., Hassani, A., Orlov, N., Shi, H.: OneFormer: One Trans- former to Rule Universal Image Segmentation. In: CVPR (2023)

  17. [17]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kamann, C., Rother, C.: Benchmarking the robustness of semantic segmentation models. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). p. 8825–8835. IEEE (Jun 2020).https://doi.org/10.1109/ cvpr42600.2020.00885

  18. [18]

    In: IEEE Winter Conf

    Kim, M., Gao, S., Hsu, Y.C., Shen, Y., Jin, H.: Token fusion: Bridging the gap between token pruning and token merging. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1372–1381 (2024).https:// doi.org/10.1109/WACV57701.2024.00141

  19. [19]

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything (2023)

  20. [20]

    In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI

    Kong, Z., Dong, P., Ma, X., Meng, X., Niu, W., Sun, M., Shen, X., Yuan, G., Ren, B., Tang, H., et al.: Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI. pp. 620–640. Springer (2022)

  21. [21]

    In: International Conference on Learning Representations (2022)

    Liang,Y.,Ge,C.,Tong,Z.,Song,Y.,Wang,J.,Xie,P.:Notallpatchesarewhatyou need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (2022)

  22. [22]

    In: IEEE Winter Conf

    Liu, Y., Zhou, Q., Wang, J., Wang, Z., Wang, F., Wang, J., Zhang, W.: Dynamic token-pass transformers for semantic segmentation. In: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 1816–1825 (2024). https://doi.org/10.1109/WACV57701.2024.00184

  23. [23]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  24. [24]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Lu, C., de Geus, D., Dubbelman, G.: Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

  25. [25]

    In: Proceedings of the European conference on computer vision (ECCV)

    Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV). pp. 116–131 (2018) 16 T.-P. Nguyen and N.-M. Cheung

  26. [26]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

    Norouzi, N., Sorlova, S., de Geus, D., Dubbelman, G.: ALGM: Adaptive Local- then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR) (2024)

  27. [27]

    In: 2007 15th European signal processing conference

    Roy, O., Vetterli, M.: The effective rank: A measure of effective dimensionality. In: 2007 15th European signal processing conference. pp. 606–610. IEEE (2007)

  28. [28]

    In: Advances in Neural Information Processing Systems (2022)

    Shen, M., Yin, H., Molchanov, P., Mao, L., Liu, J., Alvarez, J.: Structural prun- ing via latency-saliency knapsack. In: Advances in Neural Information Processing Systems (2022)

  29. [29]

    In: Proceedings of the 40th International Conference on Machine Learning

    Shi, D., Tao, C., Jin, Y., Yang, Z., Yuan, C., Wang, J.: UPop: Unified and progres- sive pruning for compressing vision-language transformers. In: Proceedings of the 40th International Conference on Machine Learning. vol. 202, pp. 31292–31311. PMLR (2023)

  30. [30]

    Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for seman- tic segmentation (2021)

  31. [31]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Tang, Q., Zhang, B., Liu, J., Liu, F., Liu, Y.: Dynamic token pruning in plain vision transformers for semantic segmentation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 777–786 (2023)

  32. [32]

    Tang, Y., Wang, Y., Guo, J., Tu, Z., Han, K., Hu, H., Tao, D.: A survey on transformer compression (2024)

  33. [33]

    In: Meila, M., Zhang, T

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 10347–10357. PMLR (18–24 Jul 2021)

  34. [34]

    In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

    Wang, Z., Luo, H., WANG, P., Ding, F., Wang, F., Li, H.: VTC-LFC: Vision trans- former compression with low-frequency components. In: Thirty-Sixth Conference on Neural Information Processing Systems (2022)

  35. [35]

    In: Neural Information Processing Systems (NeurIPS) (2021)

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Neural Information Processing Systems (NeurIPS) (2021)

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18547– 18557 (June 2023)

  37. [37]

    IEEE Conf

    Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G., Shen, C.: Topformer:Tokenpyramidtransformerformobilesemanticsegmentation.In:Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR) (2022)

  38. [38]

    Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

    Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

  39. [39]

    In: CVPR (2021)

    Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., Zhang, L.: Rethinking semantic segmentation from a sequence-to- sequence perspective with transformers. In: CVPR (2021)

  40. [40]

    International Journal of Computer Vision127(3), 302–321 (2019)

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

  41. [41]

    In: International Conference on Learning Representations (2023)

    Zhuo, Z., Wang, Y., Ma, J., Wang, Y.: Towards a unified theoretical understand- ing of non-contrastive learning via rank differential mechanism. In: International Conference on Learning Representations (2023)