pith. sign in
Pith Number

pith:MHQFABVD

pith:2024:MHQFABVDPDJ25UCI6FPPO2QXTD
not attested not anchored not stored refs resolved

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Chun-Kai Fan, Denis Gudovskiy, Junpeng Ma, Kuan Cheng, Kurt Keutzer, Shanghang Zhang, Tao Huang, Tomoyuki Okuno, Wenzhao Zheng, Yohei Nakata, Yuan Zhang

SparseVLM prunes visual tokens in VLMs using text attention scores without any training or added parameters.

arxiv:2410.04417 v4 · 2024-10-06 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MHQFABVDPDJ25UCI6FPPO2QXTD}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy.

C2weakest assumption

That self-attention scores between selected text tokens and visual tokens reliably identify which visual tokens can be pruned or recycled without losing task-critical information.

C3one line summary

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

References

113 extracted · 113 resolved · 12 Pith anchors

[1] Flamingo: a visual language model for few-shot learning 2022
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[3] Token merging: Your vit but faster 2023
[4] D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al 2020
[5] Cai, M., Yang, J., Gao, J., and Lee, Y. J. Matryoshka multimodal models. In International Conference on Learning Representations, 2025 2025

Formal links

2 machine-checked theorem links

Cited by

31 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:52.300692Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

61e05006a378d3aed048f15ef76a1798fae4e7f6a4698c1b62fcea3962ec2680

Aliases

arxiv: 2410.04417 · arxiv_version: 2410.04417v4 · doi: 10.48550/arxiv.2410.04417 · pith_short_12: MHQFABVDPDJ2 · pith_short_16: MHQFABVDPDJ25UCI · pith_short_8: MHQFABVD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MHQFABVDPDJ25UCI6FPPO2QXTD \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 61e05006a378d3aed048f15ef76a1798fae4e7f6a4698c1b62fcea3962ec2680
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7f4d6f8fdd0d8c4f7bdd63f53aeab9330445dea86771847def61af753e0e9484",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-06T09:18:04Z",
    "title_canon_sha256": "5a5b188732cf551e1c8e59a062df3da8b29f8ba0ba1d9a822d2800cd56f8afd6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.04417",
    "kind": "arxiv",
    "version": 4
  }
}