LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs
Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3
The pith
Visual tokens in LVLMs can be pruned by 88-89 percent using projection residuals to a stable low-rank subspace estimated by PCA, retaining 94.7 percent of image understanding and 97.8 percent of video accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, LRCP estimates the dominant low-rank subspace of visual tokens via PCA, then scores each token by its projection residual onto this subspace and retains tokens that are poorly explained by the low-rank background.
What carries the argument
Projection residuals onto the PCA-estimated dominant low-rank subspace of the full visual token set, used to identify and retain tokens outside the stable background.
If this is right
- An 88.9 percent reduction in visual tokens for images preserves 94.7 percent of original understanding performance.
- An 87.5 percent reduction in visual tokens for videos preserves 97.8 percent of average understanding accuracy.
- The method outperforms prior attention-based and representation-based pruning techniques on the same benchmarks.
- Because the approach requires no training, it can be applied directly to any existing LVLM at inference time.
Where Pith is reading between the lines
- If similar low-rank stability appears in text or audio token sequences, the same residual scoring could extend compression to other modalities inside multimodal models.
- Real-time monitoring of residual statistics might allow the model to adjust the kept-token budget dynamically per input rather than using a fixed ratio.
- Pairing the residual-based selection with post-training quantization could produce compounded speed-ups without further accuracy loss.
Load-bearing premise
The dominant low-rank subspace estimated by PCA on the full visual token set remains reliable for scoring even after a large fraction of tokens is removed.
What would settle it
If PCA subspaces computed on random subsets of 20 percent of tokens differ substantially from the full-set subspace, or if performance retention falls below 90 percent at 80 percent token reduction on standard image and video benchmarks, the pruning rule would be falsified.
Figures
read the original abstract
Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LRCP, a training-free framework for pruning visual tokens in LVLMs. It observes that visual token representations have a pronounced low-rank structure whose dominant subspace remains stable under large random removal, estimates this subspace via PCA on the full token set, scores tokens by their projection residuals, and retains high-residual tokens. The abstract reports that this preserves 94.7% of original image-understanding performance at 88.9% token reduction and 97.8% of average video-understanding accuracy at 87.5% token reduction, positioning it as superior to attention-based and other representation-based methods by avoiding positional bias and capturing global structure.
Significance. If the low-rank stability observation and residual-based pruning prove robust, LRCP could offer a simple, training-free alternative for reducing inference costs in high-resolution image and long-video LVLMs while maintaining most performance. The emphasis on global compressibility rather than local attention scores is a conceptual strength, and the reported retention rates at high compression ratios suggest practical utility if the experimental claims hold under scrutiny.
major comments (2)
- [Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.
- [Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.
minor comments (2)
- [Method] Clarify notation for the residual score and PCA rank selection in the method description to improve reproducibility.
- [Experiments] Add explicit comparison tables with attention-based baselines and ablation on the number of PCA components.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the motivation and experimental reporting.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.
Authors: We appreciate this observation on the distinction between random subsampling and our deterministic pruning rule. While the random-removal experiments in Section 3 demonstrate general stability of the dominant subspace, we agree that a direct check under the actual LRCP pruning procedure provides stronger justification. In the revised manuscript we have added a new ablation in Section 3.2 that recomputes PCA on the tokens retained by LRCP and reports the cosine similarity of the top principal components to those from the full set. The similarity remains above 0.94 across pruning ratios up to 90 %, supporting the use of full-set PCA as a reliable estimate of the low-rank background. revision: yes
-
Referee: [Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.
Authors: We agree that additional experimental details are necessary for reproducibility and verification. The revised manuscript now includes an expanded experimental protocol subsection that specifies: (i) the exact implementations and hyper-parameters of all compared baselines, (ii) the dataset splits and evaluation metrics used for each benchmark, (iii) mean and standard deviation over three random seeds to indicate statistical variability, and (iv) an explicit control experiment that isolates positional bias by comparing LRCP against a position-shuffled variant of the attention-based methods. These additions clarify the reported retention rates and the superiority claims. revision: yes
Circularity Check
No circularity in LRCP derivation chain
full rationale
The paper motivates LRCP from an independent empirical observation that visual token representations show low-rank structure with a dominant subspace stable under random token removal (verified separately from the pruning rule). It then applies standard PCA to compute this subspace on the full set and scores tokens by projection residuals, retaining high-residual ones. This chain does not reduce any claimed result to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation; the stability test uses uniform random subsampling while the method uses deterministic residual selection, but the former is presented only as motivation rather than a constructed input that forces the output. The performance claims are evaluated on external benchmarks and do not collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022
work page 2022
-
[2]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, pages 19730–19742, 2023
work page 2023
-
[3]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[4]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[5]
MiniGPT-4: En- hancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations, 2024
work page 2024
-
[6]
GPT-4V(ision) system card.OpenAI Technical Report, 2023
OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023
work page 2023
-
[7]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015
work page 2015
-
[8]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017
work page 2017
-
[9]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019
work page 2019
-
[10]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021
work page 2021
-
[11]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014
work page 2014
-
[12]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10
work page 2014
-
[13]
nocaps: novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019
work page 2019
-
[14]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Video-llava: Learning united visual representation by alignment before projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024
work page 2024
-
[16]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, et al. Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025
-
[20]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024
work page 2024
-
[21]
Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773–1781, 2025
work page 2025
-
[22]
[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv preprint arXiv:2412.01818, 2024
-
[23]
HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models
Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, and Bin Chen. Hiprune: Training- free visual token pruning via hierarchical attention in vision-language models.arXiv preprint arXiv:2508.00553, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024
-
[25]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- vlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Con- ghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
arXiv preprint arXiv:2505.22654 , year=
Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025
-
[28]
Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026
-
[29]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024
-
[30]
Atp-llava: Adaptive token pruning for large vision language models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025
work page 2025
-
[31]
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, and Enzo Tartaglione. Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025
-
[32]
Conical visual concentration for efficient large vision-language models
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14593–14603, 2025
work page 2025
-
[33]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022
work page 2022
-
[34]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024
work page 2024
-
[35]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[36]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024
work page 2024
-
[37]
Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024
work page 2024
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[42]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexei Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
work page 2021
-
[43]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021
work page 2021
-
[44]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017
work page 2017
-
[45]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Jolliffe.Principal Component Analysis
Ian T. Jolliffe.Principal Component Analysis. Springer, 2nd edition, 2002
work page 2002
-
[47]
Åke Björck and Gene H. Golub. Numerical methods for computing angles between linear subspaces.Mathematics of Computation, 27(123):579–594, 1973
work page 1973
-
[48]
Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53 (2):217–288, 2011
work page 2011
-
[49]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023
work page 2023
-
[50]
Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025
-
[51]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[52]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024
work page 2024
-
[53]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022
work page 2022
-
[56]
Seed-bench: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024
work page 2024
-
[57]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766, 2017
work page 2017
-
[59]
David Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 190–200, 2011
work page 2011
-
[60]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. 14 A Additional Experiments and Analysis A.1 Additional Low-Rank Analysis on Other Datasets We extend the effective-dimensionality analy...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.