StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

An T. Le; Daniel Palenicek; Daniel Sonntag; Duong Nguyen; Duy M. H. Nguyen; James Zou; Jan Peters; Khoa Doan; Mai T. N. Truong; Mathias Niepert

arxiv: 2603.07307 · v2 · pith:TLXZ5ZV4new · submitted 2026-03-07 · 💻 cs.CV · cs.LG

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Duy M. H. Nguyen , Tuan A. Tran , Duong Nguyen , Siwei Xie , Trung Q. Nguyen , Mai T. N. Truong , Daniel Palenicek , An T. Le

show 12 more authors

Michael Barz TrungTin Nguyen Tuan Dam Ngan Le Minh Vu Khoa Doan Vien Ngo Pengtao Xie James Zou Daniel Sonntag Jan Peters Mathias Niepert

This is my paper

classification 💻 cs.CV cs.LG

keywords mergingstructsamtokenanythingboundaryencodermedicalprompt

0 comments

read the original abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening to protect boundary and prompt regions, and merges tokens within flat areas toward low-energy destinations with explicit token recovery. We further provide a spectral graph coarsening view showing that score-guided merging yields bounded Laplacian spectral distortion compared to random or window-restricted baselines. Across eight natural and medical benchmarks, StructSAM reduces encoder FLOPs by 25-30\% (up to 40\%+ with prompt-aware merging) with minor drops in mIoU/Dice, consistently outperforming ToMe, PiToMe, ToMeSD, VidToMe, and ALGM at the same compute.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SparseSAM: Structured Sparsification of Activations in Segment Anything Models
cs.CV 2026-05 unverdicted novelty 6.0

SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.