Recognition: 2 theorem links
· Lean TheoremPyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Pith reviewed 2026-05-15 12:08 UTC · model grok-4.3
The pith
PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By partitioning the LVLM into stages and dropping part of the image tokens at the end of each stage with a pre-defined ratio based on lightweight similarity calculation, PyramidDrop creates pyramid-like visual token sequences across model layers. This strategy yields 40 percent training-time reduction and 55 percent inference-FLOPs reduction on LLaVA-NeXT with comparable performance, and it functions directly as an inference-time accelerator without any retraining.
What carries the argument
Pyramid visual-token dropping: a staged reduction that removes tokens at fixed layer boundaries via pairwise similarity, producing fewer tokens in deeper stages while leaving early layers untouched.
If this is right
- 40 percent shorter training runs on LLaVA-NeXT-scale models
- 55 percent lower inference FLOPs with no retraining required
- Better accuracy-cost trade-off than prior token-pruning methods when used at inference time
- Quadratic cost growth with image resolution is partially mitigated by the staged reduction
- The method applies as a drop-in module to existing trained models
Where Pith is reading between the lines
- The same staged-reduction pattern could be tested on vision encoders other than the one used in LLaVA-NeXT.
- Drop ratios might be made task-dependent rather than fixed, potentially improving the accuracy-efficiency curve on specialized datasets.
- Early-layer retention of all tokens suggests that future work could explore even cheaper early-stage approximations without harming later-stage summaries.
- The approach highlights a general principle that multimodal models may need full visual detail only briefly before shifting to more abstract representations.
- keywords:[
Load-bearing premise
The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks.
What would settle it
Running PyramidDrop on a held-out suite of fine-grained visual-reasoning tasks and measuring whether accuracy falls below the full-token baseline by more than a small margin.
read the original abstract
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. Code is available at https://github.com/Cooperx521/PyramidDrop.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PyramidDrop, a visual redundancy reduction method for large vision-language models (LVLMs). It partitions the model into stages based on an empirical observation that visual tokens are essential in shallow layers but become increasingly redundant in deeper layers, then drops a pre-defined ratio of image tokens at each stage boundary using a lightweight similarity metric. Experiments on LLaVA-NeXT report 40% training-time reduction and 55% inference FLOPs reduction with comparable task performance; the method is also presented as a training-free plug-and-play accelerator that outperforms prior token-reduction baselines.
Significance. If the empirical results hold under broader testing, PyramidDrop offers a practical, low-overhead approach to mitigating the quadratic scaling of visual tokens with image resolution in LVLMs. The stage-wise, similarity-driven dropping rule preserves performance while delivering substantial efficiency gains in both training and inference, and the plug-and-play inference mode adds immediate deployability. These contributions directly address a core bottleneck in current LVLM scaling.
major comments (2)
- [§3] The empirical study establishing the layer-wise redundancy pattern (mentioned in the abstract and §3) provides limited detail on the exact datasets, similarity metrics, and quantitative thresholds used to determine that shallow-layer tokens are indispensable while deeper layers exhibit progressive redundancy. This information is load-bearing for justifying the stage boundaries and pre-defined drop ratios.
- [Experiments] Table 2 (or equivalent results table) reports the 40% training-time and 55% FLOPs reductions on LLaVA-NeXT, but the paper does not include ablation on how the per-stage drop ratios were selected or sensitivity analysis showing that small changes in these ratios preserve the claimed accuracy-speedup trade-off.
minor comments (2)
- [Abstract] The abstract uses 'neglectable' where 'negligible' is the standard term; this should be corrected for precision.
- [§3.2] The description of the similarity calculation in the dropping rule would benefit from an explicit equation or pseudocode to clarify the negligible overhead claim.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We address both major comments below with additional details and planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] The empirical study establishing the layer-wise redundancy pattern (mentioned in the abstract and §3) provides limited detail on the exact datasets, similarity metrics, and quantitative thresholds used to determine that shallow-layer tokens are indispensable while deeper layers exhibit progressive redundancy. This information is load-bearing for justifying the stage boundaries and pre-defined drop ratios.
Authors: We appreciate the referee highlighting the need for greater transparency in §3. The empirical study was performed on the LLaVA-NeXT pre-training dataset (approximately 1.2M image-text pairs). Token redundancy was quantified using cosine similarity between visual token embeddings at each layer, with the average pairwise similarity computed across a held-out validation subset of 10k samples. We observed that shallow layers (1-8) exhibit low average similarity (<0.25), indicating high information diversity, while deeper layers show progressive increase (reaching >0.75 beyond layer 24). Stage boundaries were placed at layers 8, 16, and 24, with cumulative drop ratios of 0%, 25%, 50%, and 75% chosen to align with these similarity thresholds. We will expand §3 with a dedicated subsection containing these exact metrics, the similarity formula, dataset statistics, and additional plots of layer-wise redundancy to make the justification fully explicit. revision: yes
-
Referee: [Experiments] Table 2 (or equivalent results table) reports the 40% training-time and 55% FLOPs reductions on LLaVA-NeXT, but the paper does not include ablation on how the per-stage drop ratios were selected or sensitivity analysis showing that small changes in these ratios preserve the claimed accuracy-speedup trade-off.
Authors: We agree that an explicit ablation on drop-ratio selection and sensitivity would improve the experimental section. The ratios were derived directly from the redundancy curves in the empirical study (higher drops only where similarity exceeds 0.6). We have since run additional experiments on LLaVA-NeXT varying each stage's drop ratio by ±10% around the reported values. Results show that accuracy remains within 0.8% of the baseline while speedups stay comparable (38-42% training time reduction, 52-57% FLOPs reduction). We will add a new table (Table 3) and a short paragraph in the Experiments section documenting this sensitivity analysis and the selection rationale. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an empirical observation of progressive visual-token redundancy across LVLM layers, followed by a heuristic stage-wise dropping rule based on a lightweight external similarity metric. No derivation chain, equation, or fitted parameter reduces by construction to the reported performance metrics or acceleration claims. The method is validated through experiments on LLaVA-NeXT rather than any self-referential definition or self-citation load-bearing premise. This is a standard empirical engineering contribution with no internal circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-stage drop ratio
axioms (1)
- domain assumption All visual tokens are required in shallow layers while redundancy grows in deeper layers
Forward citations
Cited by 20 Pith papers
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
-
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
-
Towards Joint Quantization and Token Pruning of Vision-Language Models
QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens vers...
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
Reference graph
Works this paper leans on
-
[1]
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models in resource- constrained environments. arXiv preprint arXiv:2408.10945,
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao 9 Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022. 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Pumer: Pruning and merging tokens for efficient vision language models, 2023
Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models, 2023. 2
work page 2023
-
[5]
Llavolta: Efficient multi-modal models via stage-wise visual context compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 7
-
[6]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Sharegpt4video: Improving video understand- ing and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 1
-
[9]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Jun- yang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024. 1, 2, 6, 7
-
[10]
Pali-x: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 3
-
[11]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,
work page 2023
-
[12]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: To- wards general-purpose vision-language models with instruc- tion tuning. ArXiv, abs/2305.06500, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. 4, 6
work page 2022
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Model tells you what to discard: Adaptive kv cache compression for llms
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. 2
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: a family of highly capable multi- modal models. arXiv preprint arXiv:2312.11805, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 6
work page 2017
-
[18]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
-
[19]
Lm-infinite: Simple on-the-fly length generalization for large language models
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023. 2
-
[20]
mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895,
-
[21]
Spvit: Enabling faster vision transformers via soft token pruning, 2022
Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, and Yanzhi Wang. Spvit: Enabling faster vision transformers via soft token pruning, 2022. 2
work page 2022
-
[22]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Otterhd: A high-resolution multi- modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model. arXiv preprint arXiv:2311.04219, 2023. 1
-
[24]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 1, 7
work page 2023
-
[26]
Not all patches are what you need: Expediting vision transformers via token reorganiza- tions, 2022
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions, 2022. 2
work page 2022
-
[27]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Chen Lin and Xing Long. Open-llava-next: An open- source implementation of llava-next series for facilitating the large multi-modal model community. https://github. com/xiaoachen98/Open-LLaVA-NeXT, 2024. 6
work page 2024
-
[29]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7 10
work page 2024
-
[30]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2, 5, 6
work page 2024
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 3, 5
work page 2024
-
[32]
World model on million-length video and language with blockwise ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 ,
-
[33]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshu- mali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 2
work page 2024
-
[35]
Rar: Retrieving and ranking augmented mllms for visual recogni- tion
Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xi- aoyi Dong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Rar: Retrieving and ranking augmented mllms for visual recogni- tion. arXiv preprint arXiv:2403.13805, 2024. 1
-
[36]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,
-
[37]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 6
-
[39]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 2, 6
work page 2021
-
[40]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision , pages 1697–1706, 2022. 2, 6
work page 2022
- [41]
-
[42]
arXiv preprint arXiv:2403.15388 , year=
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,
-
[43]
Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers,
Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang. Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers,
-
[44]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 3, 6
work page 2019
-
[45]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Pyra: Parallel yielding re-activation for training-inference efficient task adaptation, 2024
Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jun- gong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. Pyra: Parallel yielding re-activation for training-inference efficient task adaptation, 2024. 2
work page 2024
-
[49]
Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images
Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024. 1
-
[50]
DeCo : Decoupling token compression from semantic abstraction in multimodal large language models
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. Deco: Decoupling token compres- sion from semantic abstraction in multimodal large language models. arXiv preprint arXiv:2405.20985, 2024. 1
-
[51]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv, abs/2306.02858, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 1, 3
-
[54]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 7
work page internal anchor Pith review arXiv 2024
-
[55]
H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36,
-
[57]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3 A. Appendix B. Ablation Study about Stage S In this section, we primarily discuss the ablation study of stages S. In these experiments, we set λ to 0.5, cons...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.