arxiv: 2410.17247 · v2 · submitted 2024-10-22 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing , Qidong Huang , Xiaoyi Dong , Jiajie Lu , Pan Zhang , Yuhang Zang , Yuhang Cao , Conghui He

show 3 more authors

Jiaqi Wang Feng Wu Dahua Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:08 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords imageinferencelvlmspyramiddroptokenslayersperformancetraining

0 comments

The pith

PyramidDrop reduces image tokens progressively through the layers of large vision-language models to cut training time by 40% and inference FLOPs by 55% with comparable performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that all visual tokens matter in the shallow layers of large vision-language models but redundancy grows steadily in deeper layers. It divides the model into stages and drops a fixed ratio of tokens at each stage boundary using a lightweight similarity check, which produces a pyramid of shrinking token counts. This change delivers 40 percent faster training and 55 percent lower inference FLOPs on LLaVA-NeXT while keeping benchmark scores essentially unchanged. The same rule works as a plug-and-play method at inference time, needing no retraining and beating other token-reduction baselines on both speed and accuracy. Readers should care because quadratic cost growth with image resolution becomes a practical barrier; removing later-stage redundancy directly attacks that barrier.

Core claim

By partitioning the LVLM into stages and dropping part of the image tokens at the end of each stage with a pre-defined ratio based on lightweight similarity calculation, PyramidDrop creates pyramid-like visual token sequences across model layers. This strategy yields 40 percent training-time reduction and 55 percent inference-FLOPs reduction on LLaVA-NeXT with comparable performance, and it functions directly as an inference-time accelerator without any retraining.

What carries the argument

Pyramid visual-token dropping: a staged reduction that removes tokens at fixed layer boundaries via pairwise similarity, producing fewer tokens in deeper stages while leaving early layers untouched.

If this is right

40 percent shorter training runs on LLaVA-NeXT-scale models
55 percent lower inference FLOPs with no retraining required
Better accuracy-cost trade-off than prior token-pruning methods when used at inference time
Quadratic cost growth with image resolution is partially mitigated by the staged reduction
The method applies as a drop-in module to existing trained models

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged-reduction pattern could be tested on vision encoders other than the one used in LLaVA-NeXT.
Drop ratios might be made task-dependent rather than fixed, potentially improving the accuracy-efficiency curve on specialized datasets.
Early-layer retention of all tokens suggests that future work could explore even cheaper early-stage approximations without harming later-stage summaries.
The approach highlights a general principle that multimodal models may need full visual detail only briefly before shifting to more abstract representations.
keywords:[

Load-bearing premise

The assumption that a lightweight similarity-based dropping rule at stage boundaries preserves all task-critical information across diverse images and downstream tasks.

What would settle it

Running PyramidDrop on a held-out suite of fine-grained visual-reasoning tasks and measuring whether accuracy falls below the full-token baseline by more than a small margin.

read the original abstract

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. Code is available at https://github.com/Cooperx521/PyramidDrop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyramidDrop gives concrete 40% training and 55% inference speedups on LLaVA-NeXT via staged similarity drops, with public code, but the drop ratios are empirical and lightly documented.

read the letter

The main point is that PyramidDrop partitions an LVLM into stages and drops visual tokens at each stage boundary using a cheap similarity score, which produces the reported 40% training-time cut and 55% inference FLOPs reduction on LLaVA-NeXT while accuracy stays within a small margin of the full model. It also runs as a plug-and-play inference accelerator without any retraining and beats some prior token-pruning baselines on the same setup. The code is public, which makes the numbers easy to check.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PyramidDrop, a visual redundancy reduction method for large vision-language models (LVLMs). It partitions the model into stages based on an empirical observation that visual tokens are essential in shallow layers but become increasingly redundant in deeper layers, then drops a pre-defined ratio of image tokens at each stage boundary using a lightweight similarity metric. Experiments on LLaVA-NeXT report 40% training-time reduction and 55% inference FLOPs reduction with comparable task performance; the method is also presented as a training-free plug-and-play accelerator that outperforms prior token-reduction baselines.

Significance. If the empirical results hold under broader testing, PyramidDrop offers a practical, low-overhead approach to mitigating the quadratic scaling of visual tokens with image resolution in LVLMs. The stage-wise, similarity-driven dropping rule preserves performance while delivering substantial efficiency gains in both training and inference, and the plug-and-play inference mode adds immediate deployability. These contributions directly address a core bottleneck in current LVLM scaling.

major comments (2)

[§3] The empirical study establishing the layer-wise redundancy pattern (mentioned in the abstract and §3) provides limited detail on the exact datasets, similarity metrics, and quantitative thresholds used to determine that shallow-layer tokens are indispensable while deeper layers exhibit progressive redundancy. This information is load-bearing for justifying the stage boundaries and pre-defined drop ratios.
[Experiments] Table 2 (or equivalent results table) reports the 40% training-time and 55% FLOPs reductions on LLaVA-NeXT, but the paper does not include ablation on how the per-stage drop ratios were selected or sensitivity analysis showing that small changes in these ratios preserve the claimed accuracy-speedup trade-off.

minor comments (2)

[Abstract] The abstract uses 'neglectable' where 'negligible' is the standard term; this should be corrected for precision.
[§3.2] The description of the similarity calculation in the dropping rule would benefit from an explicit equation or pseudocode to clarify the negligible overhead claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address both major comments below with additional details and planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] The empirical study establishing the layer-wise redundancy pattern (mentioned in the abstract and §3) provides limited detail on the exact datasets, similarity metrics, and quantitative thresholds used to determine that shallow-layer tokens are indispensable while deeper layers exhibit progressive redundancy. This information is load-bearing for justifying the stage boundaries and pre-defined drop ratios.

Authors: We appreciate the referee highlighting the need for greater transparency in §3. The empirical study was performed on the LLaVA-NeXT pre-training dataset (approximately 1.2M image-text pairs). Token redundancy was quantified using cosine similarity between visual token embeddings at each layer, with the average pairwise similarity computed across a held-out validation subset of 10k samples. We observed that shallow layers (1-8) exhibit low average similarity (<0.25), indicating high information diversity, while deeper layers show progressive increase (reaching >0.75 beyond layer 24). Stage boundaries were placed at layers 8, 16, and 24, with cumulative drop ratios of 0%, 25%, 50%, and 75% chosen to align with these similarity thresholds. We will expand §3 with a dedicated subsection containing these exact metrics, the similarity formula, dataset statistics, and additional plots of layer-wise redundancy to make the justification fully explicit. revision: yes
Referee: [Experiments] Table 2 (or equivalent results table) reports the 40% training-time and 55% FLOPs reductions on LLaVA-NeXT, but the paper does not include ablation on how the per-stage drop ratios were selected or sensitivity analysis showing that small changes in these ratios preserve the claimed accuracy-speedup trade-off.

Authors: We agree that an explicit ablation on drop-ratio selection and sensitivity would improve the experimental section. The ratios were derived directly from the redundancy curves in the empirical study (higher drops only where similarity exceeds 0.6). We have since run additional experiments on LLaVA-NeXT varying each stage's drop ratio by ±10% around the reported values. Results show that accuracy remains within 0.8% of the baseline while speedups stay comparable (38-42% training time reduction, 52-57% FLOPs reduction). We will add a new table (Table 3) and a short paragraph in the Experiments section documenting this sensitivity analysis and the selection rationale. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical observation of progressive visual-token redundancy across LVLM layers, followed by a heuristic stage-wise dropping rule based on a lightweight external similarity metric. No derivation chain, equation, or fitted parameter reduces by construction to the reported performance metrics or acceleration claims. The method is validated through experiments on LLaVA-NeXT rather than any self-referential definition or self-citation load-bearing premise. This is a standard empirical engineering contribution with no internal circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that token redundancy increases with depth and on the effectiveness of a pre-defined per-stage drop ratio chosen to balance speed and accuracy.

free parameters (1)

per-stage drop ratio
Pre-defined fraction of tokens removed at the end of each stage; chosen to achieve the reported speed-accuracy trade-off.

axioms (1)

domain assumption All visual tokens are required in shallow layers while redundancy grows in deeper layers
Stated as the result of the authors' empirical study and used to justify the pyramid schedule.

pith-pipeline@v0.9.0 · 5636 in / 1217 out tokens · 38996 ms · 2026-05-15T12:08:02.789286+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
cs.CV 2026-04 conditional novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
Towards Joint Quantization and Token Pruning of Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

QUOTA jointly optimizes low-bit quantization and visual token pruning for VLMs by deriving pruning decisions from quantized operators, achieving 95.65% average performance retention with only 30% of visual tokens vers...
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling
cs.CV 2026-04 unverdicted novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
cs.LG 2026-03 unverdicted novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 19 Pith papers · 19 internal anchors

[1]

Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models in resource- constrained environments

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopou- los, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models in resource- constrained environments. arXiv preprint arXiv:2408.10945,

work page arXiv
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao 9 Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Pumer: Pruning and merging tokens for efficient vision language models, 2023

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Ha- jishirzi. Pumer: Pruning and merging tokens for efficient vision language models, 2023. 2

work page 2023
[5]

Llavolta: Efficient multi-modal models via stage-wise visual context compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 7

work page arXiv 2024
[6]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Sharegpt4video: Improving video understand- ing and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understand- ing and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 1

work page arXiv 2024
[9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models,

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Jun- yang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference ac- celeration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024. 1, 2, 6, 7

work page arXiv 2024
[10]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Se- bastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023. 3

work page arXiv 2023
[11]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2(3):6,

work page 2023
[12]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: To- wards general-purpose vision-language models with instruc- tion tuning. ArXiv, abs/2305.06500, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. 4, 6

work page 2022
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. 2

work page arXiv 2023
[16]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: a family of highly capable multi- modal models. arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 6

work page 2017
[18]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page
[19]

Lm-infinite: Simple on-the-fly length generalization for large language models

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023. 2

work page arXiv 2023
[20]

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv preprint arXiv:2403.12895,

work page arXiv
[21]

Spvit: Enabling faster vision transformers via soft token pruning, 2022

Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, and Yanzhi Wang. Spvit: Enabling faster vision transformers via soft token pruning, 2022. 2

work page 2022
[22]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Otterhd: A high-resolution multi- modality model

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model. arXiv preprint arXiv:2311.04219, 2023. 1

work page arXiv 2023
[24]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 1, 7

work page 2023
[26]

Not all patches are what you need: Expediting vision transformers via token reorganiza- tions, 2022

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganiza- tions, 2022. 2

work page 2022
[27]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Open-llava-next: An open- source implementation of llava-next series for facilitating the large multi-modal model community

Chen Lin and Xing Long. Open-llava-next: An open- source implementation of llava-next series for facilitating the large multi-modal model community. https://github. com/xiaoachen98/Open-LLaVA-NeXT, 2024. 6

work page 2024
[29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 7 10

work page 2024
[30]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 2, 5, 6

work page 2024
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 1, 3, 5

work page 2024
[32]

World model on million-length video and language with blockwise ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention. arXiv preprint arXiv:2402.08268 ,

work page arXiv
[33]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshu- mali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems , 36, 2024. 2

work page 2024
[35]

Rar: Retrieving and ranking augmented mllms for visual recogni- tion

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xi- aoyi Dong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Rar: Retrieving and ranking augmented mllms for visual recogni- tion. arXiv preprint arXiv:2403.13805, 2024. 1

work page arXiv 2024
[36]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

work page
[37]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 6

work page arXiv 2022
[39]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 2, 6

work page 2021
[40]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision , pages 1697–1706, 2022. 2, 6

work page 2022
[41]

Gpt-4v(ision) system card, 2024

OpenAI. Gpt-4v(ision) system card, 2024. 1

work page 2024
[42]

arXiv preprint arXiv:2403.15388 , year=

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,

work page arXiv
[43]

Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers,

Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, and Jiaqi Wang. Crossget: Cross-guided ensem- ble of tokens for accelerating vision-language transformers,

work page
[44]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 3, 6

work page 2019
[45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Pyra: Parallel yielding re-activation for training-inference efficient task adaptation, 2024

Yizhe Xiong, Hui Chen, Tianxiang Hao, Zijia Lin, Jun- gong Han, Yuesong Zhang, Guoxin Wang, Yongjun Bao, and Guiguang Ding. Pyra: Parallel yielding re-activation for training-inference efficient task adaptation, 2024. 2

work page 2024
[49]

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. arXiv preprint arXiv:2403.11703, 2024. 1

work page arXiv 2024
[50]

DeCo : Decoupling token compression from semantic abstraction in multimodal large language models

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu Sun, and Lu Hou. Deco: Decoupling token compres- sion from semantic abstraction in multimodal large language models. arXiv preprint arXiv:2405.20985, 2024. 1

work page arXiv 2024
[51]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv, abs/2306.02858, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output

Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, et al. Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual in- put and output. arXiv preprint arXiv:2407.03320, 2024. 1, 3

work page arXiv 2024
[54]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 7

work page internal anchor Pith review arXiv 2024
[55]

H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36,

work page
[57]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 3 A. Appendix B. Ablation Study about Stage S In this section, we primarily discuss the ablation study of stages S. In these experiments, we set λ to 0.5, cons...

work page internal anchor Pith review Pith/arXiv arXiv 2023