pith. sign in

arxiv: 2605.29657 · v1 · pith:JZNRZBINnew · submitted 2026-05-28 · 💻 cs.CV · cs.AI

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

Pith reviewed 2026-06-29 08:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords token pruningvision-language modelsefficient inferenceattention mechanismsregister tokensadaptive pruningtraining-free methods
0
0 comments X

The pith

OccamToken replaces absolute visual token ranking with register-anchored relative testing to prune tokens down to 1.4% retention while keeping over 93% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models incur high prefill costs from long visual token sequences. Absolute ranking methods that pick a fixed top-K set prove brittle because attention sinks skew scores and because image redundancy plus query dependence make any fixed budget unreliable. OccamToken instead anchors pruning on register tokens that absorb low-information attention patterns and uses them as a reference to test whether each visual token adds genuine evidence. Dynamic thresholds derived from register attention then drive both image-adaptive redundancy removal and query-adaptive relevance removal. The approach runs without training and improves the accuracy-efficiency curve on LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL.

Core claim

The paper claims that register tokens naturally absorb low-information attention patterns and therefore supply a stable reference for identifying genuinely informative visual evidence. Replacing absolute importance ranking with register-anchored relative evidence testing yields both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention, delivering consistent gains in the accuracy-efficiency trade-off without any additional training.

What carries the argument

Register-anchored relative evidence testing, which compares each visual token's attention contribution against a register-based reference to decide retention via dynamic thresholds.

If this is right

  • Accuracy-efficiency trade-off improves across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL without training.
  • Stable compression remains possible even at the 1.4% retention regime.
  • Both image-adaptive redundancy pruning and query-adaptive relevance pruning are achieved through the same register-derived thresholds.
  • The method eliminates the need for per-input fixed token budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same register-reference idea could be tested in other attention-heavy multimodal architectures where certain tokens act as sinks.
  • Memory footprint during inference would drop roughly in proportion to the token reduction if the accuracy claim holds.
  • Future work could examine whether the register tokens themselves can be further compressed once they have served as the reference.

Load-bearing premise

Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.

What would settle it

On LLaVA-NeXT, accuracy at a 40-token budget with register-anchored relative testing falls below the accuracy obtained by absolute top-K ranking at the same budget.

read the original abstract

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes OccamToken, a training-free token pruning framework for VLMs that replaces absolute importance ranking with register-anchored relative evidence testing. It claims that register tokens serve as a stable low-information reference, enabling image-adaptive redundancy pruning and query-adaptive relevance pruning via dynamic thresholds; experiments on LLaVA-NeXT, LLaVA-v1.5 and Qwen3-VL reportedly reduce 2880 visual tokens to ~40 while retaining >93% accuracy even at 1.4% retention.

Significance. If the central claims hold after proper isolation of the register effect, the method would provide a practical, training-free route to extreme visual-token compression that adapts to both image redundancy and query dependence, addressing known brittleness of fixed top-K pruning in the presence of attention sinks.

major comments (2)
  1. [Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.
  2. [Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.
minor comments (2)
  1. [Abstract] Abstract: quantitative claims (2,880 tokens → ~40, >93% accuracy) are presented without reference to the exact evaluation protocol, datasets, or number of runs; these details belong in the abstract or a dedicated experimental-setup paragraph.
  2. Throughout: notation for register tokens versus visual tokens should be introduced once with consistent symbols rather than relying on prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our core claims. We address each major point below and will revise the manuscript accordingly to strengthen the evidence and exposition.

read point-by-point responses
  1. Referee: [Abstract] Abstract (key insight paragraph): the assertion that register tokens 'naturally absorb low-information attention patterns' and thereby supply a stable reference is load-bearing for the entire framework, yet the manuscript supplies no ablation that holds all other components fixed while replacing the register reference with a non-register baseline (e.g., mean visual attention or a learned sink token). Without this isolation, observed gains at 1.4% retention cannot be attributed to the register property rather than dynamic thresholding alone.

    Authors: We agree that an explicit ablation isolating the register reference is necessary to attribute performance gains specifically to the register property rather than dynamic thresholding in isolation. In the revised manuscript we will add this ablation (holding all other components fixed) comparing register-anchored relative testing against non-register baselines such as mean visual attention and a fixed sink token, across the same retention regimes and models. revision: yes

  2. Referee: [Method] Method section (register-anchored relative testing): the derivation of query-adaptive thresholds from register attention is described only at a high level; the paper must supply the precise formula (including any scaling or normalization constants) and demonstrate that the resulting thresholds remain stable across the reported retention regimes.

    Authors: We will expand the Method section to include the exact mathematical formulation of the query-adaptive thresholds (including all scaling and normalization constants) derived from register attention. We will also add a dedicated analysis subsection demonstrating threshold stability across the full range of reported retention ratios (50% down to 1.4%) on the evaluated models. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is self-contained

full rationale

The provided abstract and description present OccamToken as a training-free method that replaces absolute ranking with register-anchored relative testing, justified by a stated key insight about register tokens. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are shown that reduce any central claim to its own inputs by construction. The performance numbers (e.g., 2,880 to ~40 tokens at 93% accuracy) are presented as empirical results rather than definitional equivalences. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the core assumption stated in the text; no free parameters or invented entities are explicitly introduced.

axioms (1)
  • domain assumption Register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence.
    This premise is presented as the key insight enabling the relative evidence testing and dynamic thresholds.

pith-pipeline@v0.9.1-grok · 5810 in / 1339 out tokens · 32952 ms · 2026-06-29T08:44:09.345972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  3. [3]

    Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

    Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token?arXiv preprint arXiv:2504.02732,

  4. [4]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023a. Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision lang...

  5. [5]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual ...

  6. [6]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    12 Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  7. [7]

    important tokens

    Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for “important tokens” in multimodal language models: Duplication matters more. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9972–9991,

  8. [8]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

    Accessed: 2025-04-26. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations,

  9. [9]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025a. Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for la...

  10. [10]

    After this point, applying pruning at later layers brings only marginal accuracy gains, but reduces the efficiency benefit because more decoder layers have already processed the full visual sequence. Therefore, we 14 choose layer 11 as the default Stage II pruning layer for LLaVA-series models, which provides a practical balance between reliable query-awa...

  11. [11]

    unanswerable

    POPELi et al. (2023b) evaluates object hallucination by asking binary questions about whether a specific object exists in the image. It constructs questions under three sampling strategies: random, popular, and adversarial. This benchmark is particularly relevant for visual token pruning, since overly aggressive pruning may remove small or less salient ob...

  12. [12]

    to examine its applicability beyond image-only inputs. As shown in Table 8, OccamToken preserves 97.7% of the full-token Video-LLaVA performance, outperforming SparseVLM and VisionZip by 11.2 and 4.5 relative-accuracy points, respectively. Notably, OccamToken remains close to the full-token baseline on MSRVTT and slightly improves the score on ActivityNet...