OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning
Pith reviewed 2026-06-29 08:44 UTC · model grok-4.3
The pith
OccamToken replaces absolute visual token ranking with register-anchored relative testing to prune tokens down to 1.4% retention while keeping over 93% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that register tokens naturally absorb low-information attention patterns and therefore supply a stable reference for identifying genuinely informative visual evidence. Replacing absolute importance ranking with register-anchored relative evidence testing yields both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention, delivering consistent gains in the accuracy-efficiency trade-off without any additional training.
What carries the argument
Register-anchored relative evidence testing, which compares each visual token's attention contribution against a register-based reference to decide retention via dynamic thresholds.
If this is right
- Accuracy-efficiency trade-off improves across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL without training.
- Stable compression remains possible even at the 1.4% retention regime.
- Both image-adaptive redundancy pruning and query-adaptive relevance pruning are achieved through the same register-derived thresholds.
- The method eliminates the need for per-input fixed token budgets.
Where Pith is reading between the lines
- The same register-reference idea could be tested in other attention-heavy multimodal architectures where certain tokens act as sinks.
- Memory footprint during inference would drop roughly in proportion to the token reduction if the accuracy claim holds.
- Future work could examine whether the register tokens themselves can be further compressed once they have served as the reference.
Load-bearing premise
Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.
What would settle it
On LLaVA-NeXT, accuracy at a 40-token budget with register-anchored relative testing falls below the accuracy obtained by absolute top-K ranking at the same budget.
read the original abstract
Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes OccamToken, a training-free token pruning framework for VLMs that replaces absolute importance ranking with register-anchored relative evidence testing. It claims that register tokens serve as a stable low-information reference, enabling image-adaptive redundancy pruning and query-adaptive relevance pruning via dynamic thresholds; experiments on LLaVA-NeXT, LLaVA-v1.5 and Qwen3-VL reportedly reduce 2880 visual tokens to ~40 while retaining >93% accuracy even at 1.4% retention.
Significance. If the central claims hold after proper isolation of the register effect, the method would provide a practical, training-free route to extreme visual-token compression that adapts to both image redundancy and query dependence, addressing known brittleness of fixed top-K pruning in the presence of attention sinks.
major comments (2)
- [Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.
- [Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.
minor comments (2)
- [Abstract] Abstract: quantitative claims (2,880 tokens → ~40, >93% accuracy) are presented without reference to the exact evaluation protocol, datasets, or number of runs; these details belong in the abstract or a dedicated experimental-setup paragraph.
- Throughout: notation for register tokens versus visual tokens should be introduced once with consistent symbols rather than relying on prose descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our core claims. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and exposition.
read point-by-point responses
-
Referee: [Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.
Authors: We agree that an explicit ablation isolating the register reference is necessary to attribute performance gains specifically to the register property rather than dynamic thresholding in isolation. In the revised manuscript we will add this ablation (holding all other components fixed) comparing register-anchored relative testing against non-register baselines such as mean visual attention and a fixed sink token, across the same retention regimes and models. revision: yes
-
Referee: [Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.
Authors: We will expand the Method section to include the exact mathematical formulation of the query-adaptive thresholds (including all scaling and normalization constants) derived from register attention. We will also add a dedicated analysis subsection demonstrating threshold stability across the full range of reported retention ratios (50% down to 1.4%) on the evaluated models. revision: yes
Circularity Check
No circularity: derivation is self-contained
full rationale
The provided abstract and description present OccamToken as a training-free method that replaces absolute ranking with register-anchored relative testing, justified by a stated key insight about register tokens. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are shown that reduce any central claim to its own inputs by construction. The performance numbers (e.g., 2,880 to ~40 tokens at 93% accuracy) are presented as empirical results rather than definitional equivalences. This matches the default expectation of no significant circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.
Reference graph
Works this paper leans on
-
[1]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,
Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,
-
[4]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision lang...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual ...
2023
-
[6]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
12 Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
important tokens
Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for “important tokens” in multimodal language models: Duplication matters more. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9972–9991,
2025
-
[8]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis
Accessed: 2025-04-26. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations,
2025
-
[9]
Fit and prune: Fast and training-free visual token pruning for multi-modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025a. Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for la...
-
[10]
After this point, applying pruning at later layers brings only marginal accuracy gains, but reduces the efficiency benefit because more decoder layers have already processed the full visual sequence. Therefore, we 14 choose layer 11 as the default Stage II pruning layer for LLaVA-series models, which provides a practical balance between reliable query-awa...
2025
-
[11]
unanswerable
POPELi et al. (2023b) evaluates object hallucination by asking binary questions about whether a specific object exists in the image. It constructs questions under three sampling strategies: random, popular, and adversarial. This benchmark is particularly relevant for visual token pruning, since overly aggressive pruning may remove small or less salient ob...
2022
-
[12]
to examine its applicability beyond image-only inputs. As shown in Table 8, OccamToken preserves 97.7% of the full-token Video-LLaVA performance, outperforming SparseVLM and VisionZip by 11.2 and 4.5 relative-accuracy points, respectively. Notably, OccamToken remains close to the full-token baseline on MSRVTT and slightly improves the score on ActivityNet...
2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.