ASAP: Attention Sink Anchored Pruning

Donghun Lee; Hanyoung Kim; Jaehyuk Lee; Yanggee Kim

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Modeling Vision Transformer token flow as a lazy random walk lets pruning anchor to the attention sink and accelerate inference up to 48 percent.

2026-05-22 08:07 UTC pith:K5HC64CC

load-bearing objection ASAP recasts the attention sink as a pruning anchor through lazy random walks and diffusion distances, with claimed 48% throughput gains across tasks, but the modeling's edge over simpler attention-based pruning remains the open question. the 2 major comments →

arxiv 2605.22372 v1 pith:K5HC64CC submitted 2026-05-21 cs.LG

ASAP: Attention Sink Anchored Pruning

Jaehyuk Lee , Hanyoung Kim , Yanggee Kim , Donghun Lee This is my paper

classification cs.LG

keywords vision transformerstoken pruningattention sinkrandom walkdiffusion distancemodel efficiencyinference accelerationvisual recognition

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision Transformers suffer from slow computation at high resolutions because self-attention scales quadratically with the number of tokens. Current pruning approaches fail because they rely on attention scores from one layer, which tend to keep useless background tokens due to the attention sink effect. This paper shows that treating the entire information flow as a lazy random walk reveals the sink as a central point, and measuring how far each token diffuses from it separates useful from redundant tokens. Pruning based on this separation speeds up the model substantially on image, video, and vision-language tasks while accuracy stays the same or gets better.

Core claim

The central discovery is that the attention sink can be turned into an asset for pruning by modeling the ViT as a lazy random walk on tokens. The sink accumulates most of the probability mass in the cumulative transition matrix, so the diffusion distance from this sink within that matrix identifies which tokens carry foreground information and which are background redundancy. Radial Diffusion Clustering then groups tokens by this distance, and Transition Weight Pooling merges the redundant ones, all in a single training-free step.

What carries the argument

The lazy random walk on the attention graph, where the attention sink acts as the main probability accumulator and diffusion distance to it determines token importance for pruning.

Load-bearing premise

That the attention sink reliably collects the bulk of the probability mass in a lazy random walk model of token interactions, making distance to it a good way to tell important tokens from compressible ones.

What would settle it

A test where tokens with large diffusion distance to the sink are pruned and the resulting model shows a bigger accuracy drop on a standard benchmark than a competing method using direct attention scores.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

ASAP recasts the attention sink as a pruning anchor through lazy random walks and diffusion distances, with claimed 48% throughput gains across tasks, but the modeling's edge over simpler attention-based pruning remains the open question.

read the letter

The main thing to know is that this paper gives a training-free way to prune tokens in Vision Transformers by anchoring on the attention sink. They model the flow as a lazy random walk, build a cumulative transition matrix, measure diffusion distance from the sink, and then do radial clustering plus weight pooling to drop background tokens in one pass. The abstract and experiments claim this beats prior local-metric methods on image, video, and vision-language benchmarks while keeping or improving accuracy and cutting compute by up to 48% throughput.

Referee Report

2 major / 0 minor

Summary. The paper claims to introduce ASAP (Attention Sink Anchored Pruning), a training-free framework for token reduction in Vision Transformers. By modeling ViT information flow as a Lazy Random Walk, it identifies the attention sink as a dominant accumulator of probability mass using diffusion distance in the cumulative transition matrix. Tokens are partitioned via Radial Diffusion Clustering and background redundancy is compressed through Transition Weight Pooling. Extensive experiments on image, video, and vision-language tasks are said to show that ASAP outperforms state-of-the-art methods, with throughput acceleration up to 48% while maintaining or exceeding baseline accuracy.

Significance. If the results hold, this work could advance token pruning techniques by turning the attention sink phenomenon into an advantage rather than a liability. The training-free aspect and application across multiple modalities are notable strengths. However, the soundness of the lazy random walk modeling is central to the claims, and without detailed verification, the significance remains conditional on resolving the identified modeling concerns.

major comments (2)

Abstract: The abstract asserts outperformance and 48% throughput gain but supplies no quantitative tables, error bars, ablation details, or exact definitions of the cumulative transition matrix and radial clustering; central empirical claim cannot be verified from the given text alone.
Lazy Random Walk modeling: The lazy-random-walk modeling is presented as a way to justify using the sink as anchor, yet no equations show whether the diffusion distance is derived independently or simply restates attention scores under a new name; this creates moderate risk that the claimed advantage is definitional rather than substantive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our paper 'ASAP: Attention Sink Anchored Pruning'. We address each major comment in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: Abstract: The abstract asserts outperformance and 48% throughput gain but supplies no quantitative tables, error bars, ablation details, or exact definitions of the cumulative transition matrix and radial clustering; central empirical claim cannot be verified from the given text alone.

Authors: We acknowledge that the abstract, due to its brevity, does not include tables or detailed definitions. The full manuscript provides these in Sections 4 and 3, respectively, with quantitative results in Tables 1-4 showing comparisons, including standard deviations where relevant, and ablations in Section 4.3. The cumulative transition matrix is defined in Equation (2) as the product of per-layer transition matrices, and radial diffusion clustering is detailed in Algorithm 1. To improve clarity, we will revise the abstract to briefly reference the key performance metrics and direct readers to the relevant sections for definitions and details. revision: partial
Referee: Lazy Random Walk modeling: The lazy-random-walk modeling is presented as a way to justify using the sink as anchor, yet no equations show whether the diffusion distance is derived independently or simply restates attention scores under a new name; this creates moderate risk that the claimed advantage is definitional rather than substantive.

Authors: The lazy random walk is not merely a renaming of attention scores. We model the information flow with a transition matrix that includes a laziness factor to account for the sink's accumulation of probability mass over multiple layers, as described in Section 3.1. The diffusion distance is then computed using the cumulative transition matrix raised to power t, which integrates information across layers. This is distinct from single-layer attention. We will add explicit equations in the revised manuscript (e.g., expanding Equation (1) to show the lazy transition P = (1 - alpha)W + alpha I, where W is the normalized attention, and the diffusion distance d(i,j) = || (P^t)_i - (P^t)_j ||) to demonstrate the independent derivation and its advantages over local metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the ASAP derivation chain

full rationale

The paper presents the Lazy Random Walk modeling of ViT information flow as an interpretive framework to recast the attention sink as an anchor, followed by explicit construction of a cumulative transition matrix, diffusion distance computation, Radial Diffusion Clustering, and Transition Weight Pooling. These steps are introduced as new operations rather than reductions of existing quantities by definition or self-citation. No equations in the provided text show a fitted parameter or attention score being renamed as a 'prediction' or 'derived distance.' The central claims rest on this modeling choice plus empirical results across tasks, making the derivation self-contained against external benchmarks with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating ViT attention as a lazy random walk whose cumulative matrix yields a meaningful diffusion distance to the sink; full paper would likely introduce additional clustering and pooling parameters whose values are not visible in the abstract.

free parameters (1)

number of diffusion steps or cluster radius
Required for radial diffusion clustering and cumulative matrix construction; value not stated in abstract.

axioms (1)

domain assumption ViT token interactions can be faithfully modeled as a lazy random walk on the attention graph
Invoked to identify the sink as dominant probability-mass accumulator and to define diffusion distance.

pith-pipeline@v0.9.0 · 5688 in / 1477 out tokens · 39516 ms · 2026-05-22T08:07:41.046874+00:00 · methodology

0 comments

read the original abstract

Vision Transformers (ViTs) face severe computational bottlenecks due to the quadratic complexity of self-attention at high resolutions. Existing token reduction methods rely on local metrics - such as single-layer attention scores - that are inherently vulnerable to the attention sink phenomenon, where uninformative tokens are paradoxically preserved over salient foreground objects. We propose ASAP (Attention Sink Anchored Pruning), a training-free framework that recasts this sink as a feature. Modeling ViT information flow as a Lazy Random Walk, ASAP identifies the sink as a dominant accumulator of probability mass. By computing the diffusion distance to the sink within the cumulative transition matrix, ASAP partitions tokens via Radial Diffusion Clustering and compresses background redundancy through Transition Weight Pooling in a single shot. Extensive experiments across image, video, and vision-language tasks demonstrate ASAP outperforms state-of-the-art methods, accelerating throughput by up to 48% while maintaining - or even exceeding - baseline accuracy.

Figures

Figures reproduced from arXiv: 2605.22372 by Donghun Lee, Hanyoung Kim, Jaehyuk Lee, Yanggee Kim.

**Figure 2.** Figure 2: Overview of ASAP. (a) Lazy Random Walk models ViT information flow for stable [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results across different backbones (DeiT-Base, ViT-AugReg, LV-ViT-S). Our method consistently preserves foreground objects across diverse architectures and token densities. D(xi , xs) = [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Necessity of cumulative attention. While our full framework (W/ Markov Chain) success [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on Kinetics-400 using CLIP ViT. (Top) Original sequence. (Middle) The [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy–FLOPs tradeoff on DeiT-Base for varying K and τ . The red circle marks the selected operating point (K=6, τ=7). Hyperparameter Sensitivity. ASAP introduces two primary hyperparameters: the cluster count K and the sink detection threshold τ . (We fix α = 0.5 following the convention of Attention Rollout [6]; sensitivity analysis for α is provided in Appendix J.) [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 7.** Figure 7: Sink emergence dynamics for DeiT-Base (L=12, N=197) and CLIP ViT-Large (L=24, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy–FLOPs trade-off on ViT-AugReg for varying [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative analysis of hallucination suppression on POPE. Each row shows the input image, [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results on DeiT-Base L.2 ViT-AugReg [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results on ViT-AugReg 25 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results on LV-ViT-S M Random Anchor Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Negative case of random anchor selection. When the anchor is inadvertently assigned [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Positive case of random anchor selection. When the randomly selected anchor fortuitously [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[2]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024
[3]

Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

Xinjian Wu, Fanhu Zeng, Xiudong Wang, and Xinghao Chen. Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

work page arXiv 2023
[4]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

work page 2025
[5]

Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16070–16079, 2024

work page 2024
[6]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

work page 2020
[7]

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations.arXiv preprint arXiv:2202.07800, 2022

work page Pith review arXiv 2022
[8]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021

work page 2021
[9]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, and Wenjie Pei. Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[11]

Rollout-guided token pruning for efficient video understanding

Yonatan Dinai, Ishay Goldin, Avraham Raviv, and Niv Zehngut. Rollout-guided token pruning for efficient video understanding. In2025 IEEE International Conference on Image Processing (ICIP), pages 37–42. IEEE, 2025. 10

work page 2025
[12]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[15]

Barbero et al

Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025
[16]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and 1 others

Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546, 2025

work page arXiv 2025
[17]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

work page 2012
[18]

Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

Ronald R Coifman and Stéphane Lafon. Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

work page 2006
[19]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall. Adaptive token sampling for efficient vision transformers. InEuropean conference on computer vision, pages 396–414. Springer, 2022

work page 2022
[20]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015
[21]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021
[22]

arXiv preprint arXiv:2106.10270 , year=

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers.arXiv preprint arXiv:2106.10270, 2021

work page arXiv 2021
[23]

All tokens matter: Token labeling for training better vision transformers

Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18590–18602. Curran Asso...

work page 2021
[24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[25]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[26]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 11

work page 2024
[28]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019
[30]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018
[31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

work page 2022
[32]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

work page 2023
[33]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[35]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025. 12 A Preliminaries on Diffusion Distance We provide a brief review of diffusion dis...

work page 2025
[36]

Full Convergence: D(xi, xs) = 0⇐ ⇒x i has been fully absorbed into the sink, yielding identical information routingP (t∗) i,∗ =P (t∗) s,∗

work page
[37]

Justification(1) Follows directly from the positive definiteness of the ℓ2 norm: ∥v∥2 = 0⇐ ⇒ v=0

Trajectory Separation:Tokens with large D(xi, xs) exhibit distinct information routing patterns information trajectories from the sink, guaranteed by their geometric separation in theP (t∗) manifold. Justification(1) Follows directly from the positive definiteness of the ℓ2 norm: ∥v∥2 = 0⇐ ⇒ v=0. (2) A large separation D(xi, s) =δ >0 implies ∥P (t∗) i,∗ −...

work page

[1] [1]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[2] [2]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024

[3] [3]

Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

Xinjian Wu, Fanhu Zeng, Xiudong Wang, and Xinghao Chen. Ppt: Token pruning and pooling for efficient vision transformers.arXiv preprint arXiv:2310.01812, 2023

work page arXiv 2023

[4] [4]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

work page 2025

[5] [5]

Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers

Hongjie Wang, Bhishma Dedhia, and Niraj K Jha. Zero-tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16070–16079, 2024

work page 2024

[6] [6]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4190–4197, 2020

work page 2020

[7] [7]

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations.arXiv preprint arXiv:2202.07800, 2022

work page Pith review arXiv 2022

[8] [8]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949, 2021

work page 2021

[9] [9]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, and Wenjie Pei. Prune redundancy, preserve essence: Vision token compression in VLMs via synergistic importance-diversity. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[11] [11]

Rollout-guided token pruning for efficient video understanding

Yonatan Dinai, Ishay Goldin, Avraham Raviv, and Niv Zehngut. Rollout-guided token pruning for efficient video understanding. In2025 IEEE International Conference on Image Processing (ICIP), pages 37–42. IEEE, 2025. 10

work page 2025

[12] [12]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[15] [15]

Barbero et al

Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veliˇckovi´c, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025

work page arXiv 2025

[16] [16]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and 1 others

Valeria Ruscio, Umberto Nanni, and Fabrizio Silvestri. What are you sinking? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546, 2025

work page arXiv 2025

[17] [17]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

work page 2012

[18] [18]

Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

Ronald R Coifman and Stéphane Lafon. Diffusion maps.Applied and computational harmonic analysis, 21(1):5–30, 2006

work page 2006

[19] [19]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall. Adaptive token sampling for efficient vision transformers. InEuropean conference on computer vision, pages 396–414. Springer, 2022

work page 2022

[20] [20]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

work page 2015

[21] [21]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

work page 2021

[22] [22]

arXiv preprint arXiv:2106.10270 , year=

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers.arXiv preprint arXiv:2106.10270, 2021

work page arXiv 2021

[23] [23]

All tokens matter: Token labeling for training better vision transformers

Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18590–18602. Curran Asso...

work page 2021

[24] [24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[25] [25]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024

[26] [26]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 11

work page 2024

[28] [28]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017

[29] [29]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

work page 2019

[30] [30]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018

[31] [31]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

work page 2022

[32] [32]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

work page 2023

[33] [33]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[35] [35]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025. 12 A Preliminaries on Diffusion Distance We provide a brief review of diffusion dis...

work page 2025

[36] [36]

Full Convergence: D(xi, xs) = 0⇐ ⇒x i has been fully absorbed into the sink, yielding identical information routingP (t∗) i,∗ =P (t∗) s,∗

work page

[37] [37]

Justification(1) Follows directly from the positive definiteness of the ℓ2 norm: ∥v∥2 = 0⇐ ⇒ v=0

Trajectory Separation:Tokens with large D(xi, xs) exhibit distinct information routing patterns information trajectories from the sink, guaranteed by their geometric separation in theP (t∗) manifold. Justification(1) Follows directly from the positive definiteness of the ℓ2 norm: ∥v∥2 = 0⇐ ⇒ v=0. (2) A large separation D(xi, s) =δ >0 implies ∥P (t∗) i,∗ −...

work page