OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

Bofan Lyu; Geng Li; Gen Li; Guohao Chen; Jianfei Yang; Kuangji Zuo; Shilin Shan; Ting Chen; Tuo An

arxiv: 2605.29657 · v1 · pith:JZNRZBINnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

Geng Li , Guohao Chen , Ting Chen , Shilin Shan , Kuangji Zuo , Bofan Lyu , Tuo An , Gen Li

show 1 more author

Jianfei Yang

This is my paper

Pith reviewed 2026-06-29 08:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords token pruningvision-language modelsefficient inferenceattention mechanismsregister tokensadaptive pruningtraining-free methods

0 comments

The pith

OccamToken replaces absolute visual token ranking with register-anchored relative testing to prune tokens down to 1.4% retention while keeping over 93% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models incur high prefill costs from long visual token sequences. Absolute ranking methods that pick a fixed top-K set prove brittle because attention sinks skew scores and because image redundancy plus query dependence make any fixed budget unreliable. OccamToken instead anchors pruning on register tokens that absorb low-information attention patterns and uses them as a reference to test whether each visual token adds genuine evidence. Dynamic thresholds derived from register attention then drive both image-adaptive redundancy removal and query-adaptive relevance removal. The approach runs without training and improves the accuracy-efficiency curve on LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL.

Core claim

The paper claims that register tokens naturally absorb low-information attention patterns and therefore supply a stable reference for identifying genuinely informative visual evidence. Replacing absolute importance ranking with register-anchored relative evidence testing yields both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention, delivering consistent gains in the accuracy-efficiency trade-off without any additional training.

What carries the argument

Register-anchored relative evidence testing, which compares each visual token's attention contribution against a register-based reference to decide retention via dynamic thresholds.

If this is right

Accuracy-efficiency trade-off improves across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL without training.
Stable compression remains possible even at the 1.4% retention regime.
Both image-adaptive redundancy pruning and query-adaptive relevance pruning are achieved through the same register-derived thresholds.
The method eliminates the need for per-input fixed token budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same register-reference idea could be tested in other attention-heavy multimodal architectures where certain tokens act as sinks.
Memory footprint during inference would drop roughly in proportion to the token reduction if the accuracy claim holds.
Future work could examine whether the register tokens themselves can be further compressed once they have served as the reference.

Load-bearing premise

Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.

What would settle it

On LLaVA-NeXT, accuracy at a 40-token budget with register-anchored relative testing falls below the accuracy obtained by absolute top-K ranking at the same budget.

read the original abstract

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OccamToken shows practical training-free pruning that hits high compression with retained accuracy, but the register reference needs an ablation to separate it from dynamic thresholding.

read the letter

The main point is that OccamToken replaces absolute token ranking with register-anchored relative testing and reports strong compression numbers without any training. On LLaVA-NeXT it drops 2880 visual tokens to roughly 40 while keeping over 93 percent accuracy, and it shows similar patterns on LLaVA-v1.5 and Qwen3-VL.

What is new is the framing that register tokens absorb low-information attention and therefore serve as a stable baseline for deciding whether a visual token adds real evidence. The method then derives image-adaptive and query-adaptive thresholds from that reference. This is positioned against the brittleness of fixed top-K selection under attention sinks and varying queries.

The paper does well at stating a clear problem with existing absolute-ranking approaches and at giving concrete retention numbers at the extreme 1.4 percent regime. Those numbers, if they hold under the full experimental protocol, would be useful for anyone trying to cut prefill cost on long visual sequences.

The soft spot is the missing isolation for the register claim. The abstract presents registers as naturally low-information sinks, yet there is no sign of an ablation that keeps the dynamic-threshold logic fixed and swaps the register reference for a simple mean or non-register baseline. Without that, the gains could come from adaptivity alone rather than the specific register property. The abstract also omits dataset details, baselines, and variance, so the full paper must supply those for the accuracy claims to be evaluable.

This is for practitioners who need to reduce VLM inference cost on visual inputs. A reader working on deployment or token compression would get value from trying the method even if they later add their own controls.

It deserves peer review because the empirical target is concrete and the core idea is simple enough to test.

Referee Report

2 major / 2 minor

Summary. The paper proposes OccamToken, a training-free token pruning framework for VLMs that replaces absolute importance ranking with register-anchored relative evidence testing. It claims that register tokens serve as a stable low-information reference, enabling image-adaptive redundancy pruning and query-adaptive relevance pruning via dynamic thresholds; experiments on LLaVA-NeXT, LLaVA-v1.5 and Qwen3-VL reportedly reduce 2880 visual tokens to ~40 while retaining >93% accuracy even at 1.4% retention.

Significance. If the central claims hold after proper isolation of the register effect, the method would provide a practical, training-free route to extreme visual-token compression that adapts to both image redundancy and query dependence, addressing known brittleness of fixed top-K pruning in the presence of attention sinks.

major comments (2)

[Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.
[Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.

minor comments (2)

[Abstract] Abstract: quantitative claims (2,880 tokens → ~40, >93% accuracy) are presented without reference to the exact evaluation protocol, datasets, or number of runs; these details belong in the abstract or a dedicated experimental-setup paragraph.
Throughout: notation for register tokens versus visual tokens should be introduced once with consistent symbols rather than relying on prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our core claims. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and exposition.

read point-by-point responses

Referee: [Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.

Authors: We agree that an explicit ablation isolating the register reference is necessary to attribute performance gains specifically to the register property rather than dynamic thresholding in isolation. In the revised manuscript we will add this ablation (holding all other components fixed) comparing register-anchored relative testing against non-register baselines such as mean visual attention and a fixed sink token, across the same retention regimes and models. revision: yes
Referee: [Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.

Authors: We will expand the Method section to include the exact mathematical formulation of the query-adaptive thresholds (including all scaling and normalization constants) derived from register attention. We will also add a dedicated analysis subsection demonstrating threshold stability across the full range of reported retention ratios (50% down to 1.4%) on the evaluated models. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained

full rationale

The provided abstract and description present OccamToken as a training-free method that replaces absolute ranking with register-anchored relative testing, justified by a stated key insight about register tokens. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are shown that reduce any central claim to its own inputs by construction. The performance numbers (e.g., 2,880 to ~40 tokens at 93% accuracy) are presented as empirical results rather than definitional equivalences. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the core assumption stated in the text; no free parameters or invented entities are explicitly introduced.

axioms (1)

domain assumption Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.
This premise is presented as the key insight enabling the relative evidence testing and dynamic thresholds.

pith-pipeline@v0.9.1-grok · 5810 in / 1339 out tokens · 32952 ms · 2026-06-29T08:44:09.345972+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

work page arXiv
[4]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision lang...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual ...

2023
[6]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

12 Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

important tokens

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for “important tokens” in multimodal language models: Duplication matters more. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9972–9991,

2025
[8]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

Accessed: 2025-04-26. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations,

2025
[9]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025a. Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for la...

work page arXiv
[10]

After this point, applying pruning at later layers brings only marginal accuracy gains, but reduces the efficiency benefit because more decoder layers have already processed the full visual sequence. Therefore, we 14 choose layer 11 as the default Stage II pruning layer for LLaVA-series models, which provides a practical balance between reliable query-awa...

2025
[11]

unanswerable

POPELi et al. (2023b) evaluates object hallucination by asking binary questions about whether a specific object exists in the image. It constructs questions under three sampling strategies: random, popular, and adversarial. This benchmark is particularly relevant for visual token pruning, since overly aggressive pruning may remove small or less salient ob...

2022
[12]

to examine its applicability beyond image-only inputs. As shown in Table 8, OccamToken preserves 97.7% of the full-token Video-LLaVA performance, outperforming SparseVLM and VisionZip by 11.2 and 4.5 relative-accuracy points, respectively. Notably, OccamToken remains close to the full-token baseline on MSRVTT and slightly improves the score on ActivityNet...

2012

[1] [1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

work page arXiv

[4] [4]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision lang...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual ...

2023

[6] [6]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

12 Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

important tokens

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for “important tokens” in multimodal language models: Duplication matters more. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9972–9991,

2025

[8] [8]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

Accessed: 2025-04-26. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations,

2025

[9] [9]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025a. Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for la...

work page arXiv

[10] [10]

After this point, applying pruning at later layers brings only marginal accuracy gains, but reduces the efficiency benefit because more decoder layers have already processed the full visual sequence. Therefore, we 14 choose layer 11 as the default Stage II pruning layer for LLaVA-series models, which provides a practical balance between reliable query-awa...

2025

[11] [11]

unanswerable

POPELi et al. (2023b) evaluates object hallucination by asking binary questions about whether a specific object exists in the image. It constructs questions under three sampling strategies: random, popular, and adversarial. This benchmark is particularly relevant for visual token pruning, since overly aggressive pruning may remove small or less salient ob...

2022

[12] [12]

to examine its applicability beyond image-only inputs. As shown in Table 8, OccamToken preserves 97.7% of the full-token Video-LLaVA performance, outperforming SparseVLM and VisionZip by 11.2 and 4.5 relative-accuracy points, respectively. Notably, OccamToken remains close to the full-token baseline on MSRVTT and slightly improves the score on ActivityNet...

2012