Pith Number

pith:MHQFABVD

pith:2024:MHQFABVDPDJ25UCI6FPPO2QXTD

not attested not anchored not stored refs resolved

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Chun-Kai Fan, Denis Gudovskiy, Junpeng Ma, Kuan Cheng, Kurt Keutzer, Shanghang Zhang, Tao Huang, Tomoyuki Okuno, Wenzhao Zheng, Yohei Nakata, Yuan Zhang

SparseVLM prunes visual tokens in VLMs using text attention scores without any training or added parameters.

arxiv:2410.04417 v4 · 2024-10-06 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{MHQFABVDPDJ25UCI6FPPO2QXTD}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy.

C2weakest assumption

That self-attention scores between selected text tokens and visual tokens reliably identify which visual tokens can be pruned or recycled without losing task-critical information.

C3one line summary

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

References

113 extracted · 113 resolved · 12 Pith anchors

[1] Flamingo: a visual language model for few-shot learning 2022

[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966

[3] Token merging: Your vit but faster 2023

[4] D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al 2020

[5] Cai, M., Yang, J., Gao, J., and Lee, Y. J. Matryoshka multimodal models. In International Conference on Learning Representations, 2025 2025

Formal links

2 machine-checked theorem links

Cited by

31 papers in Pith

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

ASAP: Attention Sink Anchored Pruning

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

Receipt and verification

First computed	2026-05-17T23:38:52.300692Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

61e05006a378d3aed048f15ef76a1798fae4e7f6a4698c1b62fcea3962ec2680

Aliases

arxiv: 2410.04417 · arxiv_version: 2410.04417v4 · doi: 10.48550/arxiv.2410.04417 · pith_short_12: MHQFABVDPDJ2 · pith_short_16: MHQFABVDPDJ25UCI · pith_short_8: MHQFABVD

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/MHQFABVDPDJ25UCI6FPPO2QXTD \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 61e05006a378d3aed048f15ef76a1798fae4e7f6a4698c1b62fcea3962ec2680

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "7f4d6f8fdd0d8c4f7bdd63f53aeab9330445dea86771847def61af753e0e9484",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-06T09:18:04Z",
    "title_canon_sha256": "5a5b188732cf551e1c8e59a062df3da8b29f8ba0ba1d9a822d2800cd56f8afd6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.04417",
    "kind": "arxiv",
    "version": 4
  }
}