pith. sign in
Pith Number

pith:7UUOAUYU

pith:2024:7UUOAUYUO6BUPKPARCDZXJZJQH
not attested not anchored not stored refs resolved

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Conghui He, Dahua Lin, Feng Wu, Jiajie Lu, Jiaqi Wang, Long Xing, Pan Zhang, Qidong Huang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang

PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.

arxiv:2410.17247 v2 · 2024-10-22 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{7UUOAUYUO6BUPKPARCDZXJZJQH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts.

C2weakest assumption

The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks, which is supported only by the reported experiments on LLaVA-NeXT.

C3one line summary

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

References

56 extracted · 56 resolved · 24 Pith anchors

[1] and Vandierendonck, Hans and John, Deepu and Ji, Bo , month = aug, year =
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[3] Token Merging: Your ViT But Faster 2022 · arXiv:2210.09461
[4] Pumer: Pruning and merging tokens for efficient vision language models, 2023 2023
[5] Llavolta: Efficient multi-modal models via stage-wise visual context compression 2024

Formal links

2 machine-checked theorem links

Cited by

27 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:52.581127Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

fd28e05314778347a9e088879ba72981d88bee57e5735546a2b6bed4a380c89a

Aliases

arxiv: 2410.17247 · arxiv_version: 2410.17247v2 · doi: 10.48550/arxiv.2410.17247 · pith_short_12: 7UUOAUYUO6BU · pith_short_16: 7UUOAUYUO6BUPKPA · pith_short_8: 7UUOAUYU
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: fd28e05314778347a9e088879ba72981d88bee57e5735546a2b6bed4a380c89a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4b89e9e203d463d2e8d7523502838057237d47c51d79b18c5a0f200ce59b85dd",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-22T17:59:53Z",
    "title_canon_sha256": "1d500d068a96591af0d35cd55147e28c00ef34b0380d4446340ae066d9a215e2"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.17247",
    "kind": "arxiv",
    "version": 2
  }
}