{"paper":{"title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Conghui He, Dahua Lin, Feng Wu, Jiajie Lu, Jiaqi Wang, Long Xing, Pan Zhang, Qidong Huang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang","submitted_at":"2024-10-22T17:59:53Z","abstract_excerpt":"In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom \"A picture is worth a thousand words\" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks, which is supported only by the reported experiments on LLaVA-NeXT.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f9561cced1c405381ecd76b21bf0e6e39f0ccbf19d775f23743c12b5f7452762"},"source":{"id":"2410.17247","kind":"arxiv","version":2},"verdict":{"id":"73609107-4961-4d59-930b-fc5757e89e9d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:08:02.789286Z","strongest_claim":"PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts.","one_line_summary":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks, which is supported only by the reported experiments on LLaVA-NeXT.","pith_extraction_headline":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance."},"references":{"count":56,"sample":[{"doi":"","year":null,"title":"and Vandierendonck, Hans and John, Deepu and Ji, Bo , month = aug, year =","work_id":"3a0aaf48-f5a2-416b-ab1b-b731e48b17fa","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":2,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2022,"title":"Token Merging: Your ViT But Faster","work_id":"528509bc-2611-4e7f-a772-ea14d25b6dae","ref_index":3,"cited_arxiv_id":"2210.09461","is_internal_anchor":true},{"doi":"","year":2023,"title":"Pumer: Pruning and merging tokens for efficient vision language models, 2023","work_id":"336b719e-ec94-4600-b335-5e6f05df1a9a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llavolta: Efficient multi-modal models via stage-wise visual context compression","work_id":"af34adfc-9e60-452f-b15b-8d50640ed007","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":56,"snapshot_sha256":"922c6f3cfa7b3e1c7db69fe0423d607f4fa83ef9e52d496bd4136e4d8001d956","internal_anchors":24},"formal_canon":{"evidence_count":2,"snapshot_sha256":"578c342be0f7eed5e34ef3956095338c1015c982837b28fab5597a7de83a1147"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}