{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:7UUOAUYUO6BUPKPARCDZXJZJQH","short_pith_number":"pith:7UUOAUYU","schema_version":"1.0","canonical_sha256":"fd28e05314778347a9e088879ba72981d88bee57e5735546a2b6bed4a380c89a","source":{"kind":"arxiv","id":"2410.17247","version":2},"attestation_state":"computed","paper":{"title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Conghui He, Dahua Lin, Feng Wu, Jiajie Lu, Jiaqi Wang, Long Xing, Pan Zhang, Qidong Huang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang","submitted_at":"2024-10-22T17:59:53Z","abstract_excerpt":"In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom \"A picture is worth a thousand words\" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2410.17247","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-10-22T17:59:53Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"1d500d068a96591af0d35cd55147e28c00ef34b0380d4446340ae066d9a215e2","abstract_canon_sha256":"4b89e9e203d463d2e8d7523502838057237d47c51d79b18c5a0f200ce59b85dd"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.581582Z","signature_b64":"8kVVbUH/rnxtOowuoE6aRxbtCKpaBNvx0XUq8lzmawuMKyPee2jF0no4/2IUy71n5rrqTT7DIdJIpPZqEsT5Aw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"fd28e05314778347a9e088879ba72981d88bee57e5735546a2b6bed4a380c89a","last_reissued_at":"2026-05-17T23:38:52.581127Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.581127Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Conghui He, Dahua Lin, Feng Wu, Jiajie Lu, Jiaqi Wang, Long Xing, Pan Zhang, Qidong Huang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang","submitted_at":"2024-10-22T17:59:53Z","abstract_excerpt":"In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom \"A picture is worth a thousand words\" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks, which is supported only by the reported experiments on LLaVA-NeXT.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f9561cced1c405381ecd76b21bf0e6e39f0ccbf19d775f23743c12b5f7452762"},"source":{"id":"2410.17247","kind":"arxiv","version":2},"verdict":{"id":"73609107-4961-4d59-930b-fc5757e89e9d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:08:02.789286Z","strongest_claim":"PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts.","one_line_summary":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks, which is supported only by the reported experiments on LLaVA-NeXT.","pith_extraction_headline":"PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance."},"references":{"count":56,"sample":[{"doi":"","year":null,"title":"and Vandierendonck, Hans and John, Deepu and Ji, Bo , month = aug, year =","work_id":"3a0aaf48-f5a2-416b-ab1b-b731e48b17fa","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":2,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2022,"title":"Token Merging: Your ViT But Faster","work_id":"528509bc-2611-4e7f-a772-ea14d25b6dae","ref_index":3,"cited_arxiv_id":"2210.09461","is_internal_anchor":true},{"doi":"","year":2023,"title":"Pumer: Pruning and merging tokens for efficient vision language models, 2023","work_id":"336b719e-ec94-4600-b335-5e6f05df1a9a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llavolta: Efficient multi-modal models via stage-wise visual context compression","work_id":"af34adfc-9e60-452f-b15b-8d50640ed007","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":56,"snapshot_sha256":"922c6f3cfa7b3e1c7db69fe0423d607f4fa83ef9e52d496bd4136e4d8001d956","internal_anchors":24},"formal_canon":{"evidence_count":2,"snapshot_sha256":"578c342be0f7eed5e34ef3956095338c1015c982837b28fab5597a7de83a1147"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.17247","created_at":"2026-05-17T23:38:52.581212+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.17247v2","created_at":"2026-05-17T23:38:52.581212+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.17247","created_at":"2026-05-17T23:38:52.581212+00:00"},{"alias_kind":"pith_short_12","alias_value":"7UUOAUYUO6BU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"7UUOAUYUO6BUPKPA","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"7UUOAUYU","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2503.14075","citing_title":"Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20950","citing_title":"Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15621","citing_title":"LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17447","citing_title":"FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19218","citing_title":"Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19506","citing_title":"EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01400","citing_title":"Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22911","citing_title":"ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27960","citing_title":"Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12056","citing_title":"OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08985","citing_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09429","citing_title":"Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23950","citing_title":"LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05848","citing_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12358","citing_title":"Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11240","citing_title":"Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11627","citing_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","ref_index":94,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11122","citing_title":"Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07812","citing_title":"HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08077","citing_title":"AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05848","citing_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06036","citing_title":"CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15188","citing_title":"VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17087","citing_title":"EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17320","citing_title":"Towards Joint Quantization and Token Pruning of Vision-Language Models","ref_index":42,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH","json":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH.json","graph_json":"https://pith.science/api/pith-number/7UUOAUYUO6BUPKPARCDZXJZJQH/graph.json","events_json":"https://pith.science/api/pith-number/7UUOAUYUO6BUPKPARCDZXJZJQH/events.json","paper":"https://pith.science/paper/7UUOAUYU"},"agent_actions":{"view_html":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH","download_json":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH.json","view_paper":"https://pith.science/paper/7UUOAUYU","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.17247&json=true","fetch_graph":"https://pith.science/api/pith-number/7UUOAUYUO6BUPKPARCDZXJZJQH/graph.json","fetch_events":"https://pith.science/api/pith-number/7UUOAUYUO6BUPKPARCDZXJZJQH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH/action/storage_attestation","attest_author":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH/action/author_attestation","sign_citation":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH/action/citation_signature","submit_replication":"https://pith.science/pith/7UUOAUYUO6BUPKPARCDZXJZJQH/action/replication_record"}},"created_at":"2026-05-17T23:38:52.581212+00:00","updated_at":"2026-05-17T23:38:52.581212+00:00"}