pith. sign in

arxiv: 2605.17633 · v1 · pith:7Z2RJ5YMnew · submitted 2026-05-17 · 💻 cs.CV · cs.AI

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

Pith reviewed 2026-05-20 13:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords structured sparsificationsegment anything modeltraining-freesparse attentionactivation compressioninference accelerationvision transformerresidual routing
0
0 comments X

The pith

SparseSAM applies structured sparsification to SAM's activations for 2x faster inference and 2.8x memory savings with under 0.03 mIoU loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SparseSAM as a training-free framework that sparsifies activations inside the Segment Anything Model's ViT encoder. It does this by jointly handling attention and MLP layers through fixed reordering and selective routing that keeps token identities intact. The goal is to cut the dominant latency and memory costs of SAM without retraining or the quality collapse seen in prior token-merging or sparse-attention techniques. A sympathetic reader would care because SAM is widely used for open-vocabulary segmentation, yet its encoder makes real-time or edge deployment difficult.

Core claim

SparseSAM introduces Stripe-Sort Attention that converts dense attention into static sparse patterns via a deterministic Z-order permutation and pairs it with a Residual-Consistency MLP that routes only informative tokens through the MLP while sending the rest down the residual path, yielding 2x faster inference and 2.8x memory reduction across four benchmarks while limiting mIoU loss to 0.004 at 0.4 density and 0.021 at 0.3 density.

What carries the argument

Stripe-Sort Attention with Z-order permutation for static hardware-friendly sparse attention patterns, together with Residual-Consistency MLP for selective token routing that preserves information flow without dynamic overhead.

If this is right

  • At 0.4 density the method loses only 0.004 mIoU on the benchmarks.
  • At 0.3 density the accuracy loss is 0.021, which is 2.10 times smaller than the loss from token merging.
  • Inference speed doubles while memory footprint shrinks by a factor of 2.8.
  • The approach works without retraining or fine-tuning on existing SAM checkpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The static sparse patterns created by Z-order permutation could map directly onto sparse matrix accelerators for additional speedups.
  • Because the method is training-free it can be inserted into any deployed SAM pipeline as a drop-in replacement.
  • Similar structured sparsification might extend to other ViT-based segmentation or detection models that share the same encoder bottleneck.
  • Combining SparseSAM with quantization could produce even larger efficiency gains at the cost of one extra hyper-parameter.

Load-bearing premise

The fixed Z-order reordering and residual bypass for some tokens are assumed to retain enough information for accurate open-vocabulary segmentation even at low densities and without any model retraining.

What would settle it

Applying SparseSAM at 0.3 density on the four segmentation benchmarks and measuring an mIoU drop larger than 0.021 relative to the dense baseline would falsify the minimal-accuracy-loss claim.

Figures

Figures reproduced from arXiv: 2605.17633 by Chi H. Nguyen, Duy M. H. Nguyen, Fan Lai, Hoai-Chau Tran, Khoa D. Doan, Mathias Niepert.

Figure 1
Figure 1. Figure 1: SparseSAM enables 2.02× faster SAM inference while preserving quality. These limitations point to a key gap: efficient, training-free methods that jointly reduce computation in both attention and MLP layers, while preserving token-level fidelity required for segmentation. We introduce SparseSAM, a training-free approach designed to leverage the inherent sparsity exposed by the activations in both the atten… view at source ↗
Figure 2
Figure 2. Figure 2: SAM latency profiling. SAM Architecture. SAM adopts a hierarchical Vision Transformer (ViT) encoder com￾posed of both local and global attention blocks. Local blocks partition the image into non￾overlapping windows and ap￾ply self-attention independently within each window, while global blocks perform dense attention over the entire token sequence to capture long-range de￾pendencies. 2 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 3
Figure 3. Figure 3: Permuted attention visualiza￾tion. Comparison between the original SAM attention pattern (left) and our Z-order per￾muted pattern (right). The permutation in￾duces a banded diagonal structure, allowing for significant hardware acceleration with minimal loss in spatial information. Since SAM is exceptionally sensitive to overheads, we re￾quire an approach that minimizes per-layer administrative costs while … view at source ↗
Figure 4
Figure 4. Figure 4: K-means clustering replacement. Layer 1 Layer 6 Layer 16 Layer 22 ρ 0.64 0.62 0.75 0.78 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: We employ a deterministic Z-order (Morton) traversal to linearize 2D spatial tokens into a 1D sequence. This traversal naturally preserves spatial locality by grouping neighboring pixels into localized "Z-groups." Such grouping enables the efficient computation of gradient-based importance scores within local windows (Eq. 1), which are then used to rank and route tokens. The resulting permuted sequence tra… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of Stripe-Sort Atten￾tion with different sparsity ratios We measure local gradient magnitude using the Sobel opera￾tor [9], a classical edge-detection filter that approximates the spatial derivatives of an image through a pair of fixed 3 × 3 convolutional kernels. Specifically, the gradient magnitude at each spatial position is M[i, j] = q (Sx ∗ X) 2[i, j] + (Sy ∗ X) 2[i, j], (1) where Sx and Sy den… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the residual-consistency MLP. (a) The deterministic token order [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Robustness Under Extreme Compression. SparseSAM sets a new standard for efficient segmentation, consistently outperforming existing dynamic and merging-based approaches across density levels. Notably, at high compression rates (density < 0.4), where competing methods suffer catastrophic quality degradation, SparseSAM preserves near-baseline fidelity. This stability allows for a 2.2× end-to-end acceleration… view at source ↗
Figure 9
Figure 9. Figure 9: Performance breakdown with different permutation orders. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-token MLP update norm ||∆MLP||2 at blocks 0/7/15/23 of PE-Core-L14-336 (top) and SAM-HQ ViT-L (bottom) on the same input. Reflecting their different pretraining inductive biases, the two backbones distribute MLP work in opposite ways. The first block looks similar across models — both focus updates on the butterfly — but in later layers, PE updates tokens uniformly across the image, while SAM stays sp… view at source ↗
Figure 11
Figure 11. Figure 11: Results on ImageNet[6] 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More output visualization 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SparseSAM, a training-free structured sparsification framework for the Segment Anything Model (SAM) ViT encoder. It proposes Stripe-Sort Attention, which applies a deterministic Z-order permutation to convert dense attention into static hardware-friendly sparse patterns, and Residual-Consistency MLP, which routes only informative tokens through the MLP while propagating others via the residual pathway. The paper reports that across four segmentation benchmarks, SparseSAM incurs only 0.004 mIoU loss at 0.4 density and 0.021 mIoU loss at 0.3 density (a 2.10x reduction in accuracy loss versus token merging), while delivering 2x faster inference and 2.8x memory reduction.

Significance. If the empirical results and information-preservation claims hold, the work would offer a practical, retraining-free route to accelerate and compress SAM for open-vocabulary segmentation, addressing a key bottleneck in deploying large ViT-based models under latency and memory constraints.

major comments (3)
  1. [Abstract] Abstract: the headline claim that the modules 'preserve token identity' and maintain 'exact feature statistics the SAM decoder was trained on' is load-bearing for all reported mIoU figures, yet the text supplies neither a derivation nor an ablation demonstrating that the fixed Z-order permutation respects the original ViT positional encodings or that the routing criterion discards only non-critical tokens under arbitrary prompts.
  2. [Method (Stripe-Sort Attention)] Method description of Stripe-Sort Attention: the assertion that the deterministic Z-order permutation 'eliminates dynamic masking overhead' while preserving spatial locality is central to the speedup claim, but no analysis is given of how the reordering interacts with the frozen decoder's learned positional biases.
  3. [Experiments] Experimental results: the reported mIoU deltas (0.004 at density 0.4, 0.021 at 0.3) and the 2.10x accuracy-loss reduction versus baselines are presented without error bars, run counts, or statistical tests, which prevents verification of the cross-benchmark stability asserted in the abstract.
minor comments (1)
  1. [Abstract] The term 'density' is introduced without an immediate parenthetical definition (e.g., retained token fraction); adding this would aid readers unfamiliar with the compression literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the modules 'preserve token identity' and maintain 'exact feature statistics the SAM decoder was trained on' is load-bearing for all reported mIoU figures, yet the text supplies neither a derivation nor an ablation demonstrating that the fixed Z-order permutation respects the original ViT positional encodings or that the routing criterion discards only non-critical tokens under arbitrary prompts.

    Authors: We agree that explicit justification is warranted. In the revised manuscript we will add a derivation in Section 3.1 showing that the deterministic Z-order permutation is a fixed invertible reordering within stripes that preserves relative spatial locality and therefore aligns with the frozen decoder's positional encodings. We will also include a new ablation (Appendix C) evaluating the Residual-Consistency routing criterion on a diverse set of prompts drawn from all four benchmarks, confirming that discarded tokens contribute negligibly to final mIoU. These additions directly support the reported accuracy figures. revision: yes

  2. Referee: [Method (Stripe-Sort Attention)] Method description of Stripe-Sort Attention: the assertion that the deterministic Z-order permutation 'eliminates dynamic masking overhead' while preserving spatial locality is central to the speedup claim, but no analysis is given of how the reordering interacts with the frozen decoder's learned positional biases.

    Authors: We will expand the Stripe-Sort Attention subsection to provide the requested analysis. Because the permutation is static and identical for every input, it can be exactly inverted after the sparse attention computation; the residual pathway then restores original token identities before the decoder. We will include a short proof that this composition leaves the positional bias terms unchanged relative to the original dense forward pass, together with a timing breakdown confirming the elimination of dynamic masking cost. The revised text will make the interaction explicit. revision: yes

  3. Referee: [Experiments] Experimental results: the reported mIoU deltas (0.004 at density 0.4, 0.021 at 0.3) and the 2.10x accuracy-loss reduction versus baselines are presented without error bars, run counts, or statistical tests, which prevents verification of the cross-benchmark stability asserted in the abstract.

    Authors: SparseSAM is fully deterministic and training-free, so each benchmark result has zero run-to-run variance. We will revise the Experiments section to state this explicitly, report a run count of one, and tabulate per-benchmark mIoU values so readers can directly inspect cross-benchmark consistency. Formal error bars and statistical tests are not meaningful under determinism; we will instead emphasize reproducibility and the 2.10x reduction relative to token-merging baselines on the same fixed decoder. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical mIoU and speedup results are direct measurements, not reductions to fitted inputs or self-citations

full rationale

The paper proposes training-free modules (Stripe-Sort Attention via fixed Z-order permutation and Residual-Consistency MLP with selective routing plus residual bypass) and evaluates them empirically on four segmentation benchmarks. Reported figures (0.004/0.021 mIoU loss at 0.4/0.3 density, 2x inference speedup, 2.8x memory reduction) are presented as observed outcomes versus baselines. No equations, predictions, or load-bearing claims reduce these results to parameters fitted from the same data or to self-citations whose content is unverified. The derivation chain consists of method definitions followed by independent experimental validation; the central performance claims remain falsifiable outside any internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that structured sparsity patterns derived from Z-order permutation and selective residual routing maintain segmentation quality; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Deterministic Z-order permutation transforms dense attention into static hardware-friendly sparse patterns without dynamic overhead
    Invoked to justify Stripe-Sort Attention design.
  • domain assumption Routing only informative tokens through MLP while propagating others via residual pathway preserves model output quality
    Central to Residual-Consistency MLP.

pith-pipeline@v0.9.0 · 5761 in / 1236 out tokens · 42291 ms · 2026-05-20T13:40:21.660853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 11 internal anchors

  1. [1]

    Token Merging: Your ViT But Faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

  2. [2]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025

  3. [3]

    Slimsam: 0.1% data makes segment anything slim.Advances in Neural Information Processing Systems, 37:39434–39461, 2024

    Zigeng Chen, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Slimsam: 0.1% data makes segment anything slim.Advances in Neural Information Processing Systems, 37:39434–39461, 2024

  4. [5]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  5. [6]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  6. [7]

    Space-filling curves for modeling spatial context in transformer- based whole slide image classification

    Cihan Erkan and Selim Aksoy. Space-filling curves for modeling spatial context in transformer- based whole slide image classification. InMedical Imaging 2023: Digital and Computational Pathology, volume 12471, pages 416–423. SPIE, 2023

  7. [8]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021

  8. [9]

    Pearson education india, 2009

    Rafael C Gonzalez.Digital image processing. Pearson education india, 2009

  9. [10]

    Detrs with hybrid matching.arXiv preprint arXiv:2207.13080, 2022

    Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. Detrs with hybrid matching.arXiv preprint arXiv:2207.13080, 2022

  10. [11]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  11. [12]

    Hq-sam: Im- proving the segmentation anything model with high-quality data.arXiv preprint arXiv:2306.01567, 2023

    Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality.arXiv preprint arXiv:2306.01567, 2023

  12. [13]

    Marlin: FP16xINT4 LLM inference kernel that can achieve near-ideal 4x speedups up to medium batchsizes.https://github.com/IST-DASLab/marlin, 2023

    Jiamei Kim, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Marlin: FP16xINT4 LLM inference kernel that can achieve near-ideal 4x speedups up to medium batchsizes.https://github.com/IST-DASLab/marlin, 2023. Accessed: 2026-05-07. 10

  13. [14]

    Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

    Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, and Zeke Xie. Pisa: Piecewise sparse attention is wiser for efficient diffusion transformers.arXiv preprint arXiv:2602.01077, 2026

  14. [15]

    Expediting large-scale vision transformer for dense prediction without fine-tuning

    Weicong Liang, Yuhui Yuan, Henghui Ding, Xiao Luo, Weihong Lin, Ding Jia, Zheng Zhang, Chao Zhang, and Han Hu. Expediting large-scale vision transformer for dense prediction without fine-tuning. InAdvances in Neural Information Processing Systems, volume 35, pages 35462–35477, 2022

  15. [16]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312, 2014

  16. [17]

    Group fisher pruning for practical network compression

    Liyang Liu, Shilong Zhang, Zhanghui Kuang, Aojun Zhou, Jing-Hao Xue, Xinjiang Wang, Yimin Chen, Wenming Yang, Qingmin Liao, and Wayne Zhang. Group fisher pruning for practical network compression. InInternational Conference on Machine Learning, pages 7021–7032. PMLR, 2021

  17. [18]

    Structured knowledge distillation for semantic segmentation

    Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2019

  18. [19]

    Learning efficient convolutional networks through network slimming

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. InProceedings of the IEEE International Conference on Computer Vision, pages 2736–2744, 2017

  19. [20]

    Ptq4sam: Post-training quantization for segment anything

    Chengtao Lv, Hong Chen, Jinyang Guo, Yifu Ding, and Xianglong Liu. Ptq4sam: Post-training quantization for segment anything. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 15941–15951, 2024

  20. [21]

    International Business Machines Company, 1966

    Guy M Morton.A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966

  21. [22]

    Structsam: Structure-and spectrum-preserving token merging for segment anything models.arXiv preprint arXiv:2603.07307, 2026

    Duy MH Nguyen, Tuan A Tran, Duong Nguyen, Siwei Xie, Trung Q Nguyen, Mai TN Truong, Daniel Palenicek, An T Le, Michael Barz, TrungTin Nguyen, et al. Structsam: Structure-and spectrum-preserving token merging for segment anything models.arXiv preprint arXiv:2603.07307, 2026

  22. [23]

    Mix-qsam: Mixed-precision quantization of the segment anything model

    Navin Ranjan and Andreas Savakis. Mix-qsam: Mixed-precision quantization of the segment anything model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3280–3290, 2025

  23. [24]

    JZ-Tree: GPU friendly neighbour search and friends-of-friends with dual tree walks in JAX plus CUDA

    Jens Stücker, Oliver Hahn, Lukas Winkler, Adrian Gutierrez Adame, and Thomas Flöss. Jz-tree: Gpu friendly neighbour search and friends-of-friends with dual tree walks in jax plus cuda. arXiv preprint arXiv:2604.05885, 2026

  24. [25]

    Accelerating transformers with spectrum-preserving token merging.Advances in Neural Information Processing Systems, 37:30772–30810, 2024

    Chau Tran, Duy MH Nguyen, Manh-Duy Nguyen, TrungTin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y Zou, Binh Nguyen, and Mathias Niepert. Accelerating transformers with spectrum-preserving token merging.Advances in Neural Information Processing Systems, 37:30772–30810, 2024

  25. [26]

    How many tokens do 3d point cloud transformer architectures really need?arXiv preprint arXiv:2511.05449, 2025

    Tuan Anh Tran, Duy MH Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D Doan, Roger Watten- hofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, and Paul Swoboda. How many tokens do 3d point cloud transformer architectures really need?arXiv preprint arXiv:2511.05449, 2025

  26. [27]

    Q-vlm: Post-training quantization for large vision-language models.Advances in Neural Information Processing Systems, 37:114553–114573, 2024

    Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Q-vlm: Post-training quantization for large vision-language models.Advances in Neural Information Processing Systems, 37:114553–114573, 2024

  27. [28]

    Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 11

  28. [29]

    DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

  29. [30]

    Efficientsam: Leveraged masked image pretraining for efficient segment anything.arXiv preprint arXiv:2312.00863, 2023

    Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything.arXiv preprint arXiv:2312.00863, 2023

  30. [31]

    Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025

  31. [32]

    FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005, 2025

  32. [33]

    Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023

  33. [34]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung- Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022

  34. [35]

    Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

    Jiarui Zhang, Chao Xiang, Haofeng Huang, Jingyan Wei, Haoyuan Xi, Jun Zhu, and Jianfei Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

  35. [36]

    Ahcptq: Accurate and hardware-compatible post-training quantization for segment anything model

    Wenlun Zhang, Yunshan Zhong, Shimpei Ando, and Kentaro Yoshioka. Ahcptq: Accurate and hardware-compatible post-training quantization for segment anything model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22383–22392, 2025

  36. [37]

    arXiv preprint arXiv:2306.12156 (2023) 31

    Xu Zhao, Wenchao Ding, Yongqi An, Yinglong Du, Tao Yu, Min Li, Ming Tang, and Jinqiao Wang. Fast segment anything.arXiv preprint arXiv:2306.12156, 2023

  37. [38]

    Edgesam: Prompt-driven edge-aware segmentation with segment anything.arXiv preprint, 2023

    Chong Zhou, Xiangtai Li, Chen Change Loy, and Bo Dai. Edgesam: Prompt-driven edge-aware segmentation with segment anything.arXiv preprint, 2023. 12 A Appendix A.1 Attention Kernel Pseudo Code Algorithm 1A-Shape Windowed FlashAttention-2 with Decomposed 2D Rel-Pos Bias // Attention input with decomposed 2D biasB[q, k]=B H[q, krow]+BW [q, kcol]on aw×wgrid I...