Differentiable Efficient Operator Search

Chang Xu; Cho-Jui Hsieh; Jiyuan Zhang; Tao Huang; Weiguo Feng; Xiaohuan Pei; Yuanfan Guo

arxiv: 2606.05232 · v1 · pith:5JPB6QS2new · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Differentiable Efficient Operator Search

Xiaohuan Pei , Jiyuan Zhang , Yuanfan Guo , Weiguo Feng , Tao Huang , Cho-Jui Hsieh , Chang Xu This is my paper

Pith reviewed 2026-06-28 07:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords efficient multimodal modelstoken reductiondifferentiable searchoperator searchvisual token pruningmultimodal foundation modelsneural architecture search

0 comments

The pith

Manually designed token-reduction operators are special cases of a single differentiable search space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that operators like pruning, merging, pooling, and adaptive reweighting can be viewed as different regimes in one shared operator space. It proposes a differentiable framework to search for where to reduce tokens, how many to keep, and how to process them, all under efficiency constraints. This recovers existing designs and discovers new hybrid operators that offer good accuracy-efficiency balances on multimodal tasks. The shift from manual design to search matters because it automates finding efficient inference strategies for large models.

Core claim

Token-reduction operators in efficient multimodal foundation models can be interpreted as distinct regimes of a shared operator space, so a differentiable search over layer activation, retention budget, and operator behavior can optimize performance under budget constraints, recover hand-designed baselines, and discover hybrid operators with competitive trade-offs.

What carries the argument

The parameterization of a shared operator space for joint differentiable optimization of token reduction location, count, and processing method.

If this is right

Hand-designed operators like pruning and pooling are recovered as special cases of the search.
Hybrid operators beyond manual designs can be found automatically.
The approach maintains competitive accuracy even with aggressive reduction of visual tokens.
Efficient multimodal inference can be achieved by optimizing the operator search rather than designing operators by hand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar unification might apply to other model efficiency techniques if they can be cast as parameter regimes.
The method could be extended to search over combinations with other efficiency methods like attention approximations.
Results on multimodal benchmarks suggest potential for application in other domains with high token counts, such as long-context language models.

Load-bearing premise

All relevant token reduction strategies can be represented as points within one continuous differentiable parameter space.

What would settle it

A direct comparison on multimodal benchmarks where the best manually designed operator consistently outperforms any searched operator at equivalent computational cost would falsify the utility of the shared space approach.

Figures

Figures reproduced from arXiv: 2606.05232 by Chang Xu, Cho-Jui Hsieh, Jiyuan Zhang, Tao Huang, Weiguo Feng, Xiaohuan Pei, Yuanfan Guo.

**Figure 1.** Figure 1: Overview of Efficient Operator Search. (a) EOS replaces manually designed reduction recipes with automatic operator search. (b) The searched hybrid operator lies inside the unified operator space and improves performance under the same token budget. is sharply merged, uniformly pooled, or softly redistributed. Under this view, pure pruning, hard merging, average pooling, and adaptive reweighting naturally … view at source ↗

**Figure 2.** Figure 2: Overview of Efficient Operator Search. Given a frozen multimodal foundation model, EOS parameterizes token reduction at each decoder layer by three coupled components: layer activation gl , retention budget cl , and operator regime Ωl = (γl , τl , θl , ρl , νl). At each active layer, important visual tokens are retained as anchors, while the remaining candidates are processed by a unified reduction operato… view at source ↗

**Figure 3.** Figure 3: Numerical verification of corner operators. Our unified operator reproduces PRUNE, MERGE, POOL, and REWEIGHT under their corresponding settings in Ωl . 3 Experiments 3.1 Experimental Setup Model and baselines. We evaluate EOS on frozen LLaVA [12] and compare it with representative corner operators, including pruning-based SPARSEVLM-V1/V2 [32], merging-based TOME [1], and pooling-based POOL. All methods use… view at source ↗

**Figure 4.** Figure 4: Effect of the alignment weight λa. Sweeping λa under fixed retained-token budgets shows that EOS is stable across λa ∈ [0.01, 0.5]. The CE-off variant removes the cross-entropy loss. 4 Ablation Studies We analyze EOS from three complementary aspects: (i) the search policy, controlled by the hiddenstate alignment weight λa in Eq. 14; (ii) the search space, including the active reducer layers R and the laye… view at source ↗

**Figure 5.** Figure 5: Effect of active reducer layers R. Subfigures (a)–(c) sweep one reducer layer while fixing the others, and subfigure (d) jointly sweeps (l1, l3). 0 0.02 0.04 0.083* 0.12 0.16 0.2 6 (0: Prune 1: Merge/Pool) 78 80 82 84 86 Rate (%) 6: Prune Merge gate r=192 r=64 r=16 EOS (0.0834) (a) γ6 0.01 0.05 0.1 0.22* 0.5 1 5 1e+02 6 ( 0: Merge : Pool) 77.5 80.0 82.5 85.0 Rate (%) 6: Merge Pool geometry r=192 r=64 r=16 … view at source ↗

**Figure 6.** Figure 6: Effect of the operator search space Ω6 = (γ6, τ6, θ6, ρ6, ν6). Each panel sweeps one parameter in the central reducer while fixing the remaining components at Θ⋆ . 192 128 96 64 32 16 Retained visual tokens (r) 60 70 80 POPE accuracy (%) Ours keeps POPE stable as r shrinks SparseVLM-v1 SparseVLM-v2 ToMe Pool Ours (EOS) (a) POPE 192 128 96 64 32 16 Retained visual tokens (r) 1400 1600 1800 MME (Perc. + Cogn… view at source ↗

**Figure 7.** Figure 7: Robustness across retained-token budgets. EOS reuses the same searched configuration Θ⋆ across different budgets. The margin over SparseVLM-v2 increases as r decreases, indicating that the searched operator regime is more robust than fixed corner operators under aggressive compression. 4.3 Effect of Operator Regime Ωl We study the operator component of the search space by sweeping γ6, τ6, θ6, ρ6, and ν6, c… view at source ↗

**Figure 8.** Figure 8: Operator-regime profile across decoder layers. (a) The searched gate γl is close to zero at the outer reducers and rises only at the central reducer, locating the interior HYBRID at L6. (b) The trajectory of γ6 during search initializes uniformly at 0.5, briefly explores the merge corner, and converges to 0.08 — a regime that no hand-designed corner can reach. A.5 Additional Result Analysis A.5.1 Per-Bench… view at source ↗

read the original abstract

Efficient multimodal foundation models often rely on manually designed token-reduction operators, such as pruning, merging, pooling, and adaptive reweighting. Although these operators appear different, we show that they can be interpreted as distinct regimes of a shared operator space. Based on this view, we introduce Efficient Operator Search, a differentiable framework that jointly searches where to reduce tokens, how many tokens to retain, and how reduced token information should be processed. The proposed search space parameterizes layer activation, retention budget, and operator behavior, while the search policy optimizes task performance under one-sided budget and cost constraints. This formulation recovers representative hand-designed baselines as special cases and further discovers hybrid operators beyond isolated manual designs. Experiments on multimodal benchmarks show that the searched operators achieve competitive accuracy-efficiency trade-offs, especially under aggressive visual-token reduction. These results suggest that efficient multimodal inference can be reframed from manual operator design to differentiable operator search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames token reduction as a joint differentiable search over location, budget, and operator type by treating manual designs as regimes in one space, but the abstract gives no equations or limits to confirm the special-case recoveries.

read the letter

The main takeaway is that they treat pruning, merging, pooling, and reweighting as points inside a single parameterized space and then run a differentiable search over layer activation, retention budget, and operator behavior under budget constraints. This recovers the hand-designed baselines as special cases and turns up hybrids that mix elements from several of them.

The work is clearest on the motivation: manual operator design is replaced by search, and the experiments on multimodal benchmarks reportedly hold up under aggressive visual token reduction. That framing is useful for anyone already running efficiency experiments on vision-language models.

The soft spot is exactly the one the stress-test note flags. The abstract asserts that the baselines emerge as special cases but supplies no parameterization, no explicit limits on the retention-budget variable, and no derivation showing that a particular setting reproduces token merging or adaptive reweighting exactly. Without those details it is impossible to tell whether the search genuinely covers the prior designs or simply explores a new space that happens to contain them. The soundness score in the report reflects this gap.

The paper is aimed at people who already work on neural architecture search or token-efficient multimodal inference. A reader who needs the concrete search space and the numbers would get something out of the full version; someone looking for a ready-to-use operator would still need the methods section.

I would send it to peer review. The claim is specific enough that referees can check the parameterization and the recovery property once the equations are on the page.

Referee Report

2 major / 1 minor

Summary. The paper claims that manually designed token-reduction operators (pruning, merging, pooling, adaptive reweighting) in efficient multimodal foundation models are distinct regimes of a shared operator space. It introduces Differentiable Efficient Operator Search, a framework that jointly optimizes layer activation, retention budget, and operator behavior under one-sided budget and cost constraints. The approach recovers hand-designed baselines as special cases, discovers hybrid operators, and yields competitive accuracy-efficiency trade-offs on multimodal benchmarks, especially under aggressive visual-token reduction.

Significance. If the shared parameterization rigorously recovers the manual operators as special cases and the discovered hybrids improve upon them, the work would reframe efficient multimodal inference as an automated differentiable search problem rather than manual design. The joint optimization of location, budget, and behavior under constraints is a potentially useful formulation if the unification holds.

major comments (2)

[Abstract] Abstract: the central unification claim that 'this formulation recovers representative hand-designed baselines as special cases' is load-bearing but unsupported by any equations, explicit limiting cases, or parameterization details (e.g., no demonstration that a particular retention-budget value exactly reproduces token merging or that the operator-behavior variable reproduces adaptive reweighting). Without this embedding, the search may explore a new space rather than extending a common one.
[Abstract] Abstract: the experimental claim that 'searched operators achieve competitive accuracy-efficiency trade-offs' is stated without any quantitative numbers, specific benchmarks, baselines, or ablation results, preventing assessment of whether the gains are meaningful or whether the search actually improves upon the recovered baselines.

minor comments (1)

[Abstract] Abstract: the phrase 'one-sided budget and cost constraints' is used without definition or clarification of what 'one-sided' denotes in the optimization policy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree the abstract should more explicitly support its central claims and will revise accordingly. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the central unification claim that 'this formulation recovers representative hand-designed baselines as special cases' is load-bearing but unsupported by any equations, explicit limiting cases, or parameterization details (e.g., no demonstration that a particular retention-budget value exactly reproduces token merging or that the operator-behavior variable reproduces adaptive reweighting). Without this embedding, the search may explore a new space rather than extending a common one.

Authors: Section 3 of the manuscript defines the shared operator space via continuous parameters for layer activation, retention budget, and operator behavior (including the reweighting function). Specific limiting values recover the baselines exactly (e.g., budget=1 with identity reweighting for no reduction; budget approaching 0 with merging-style aggregation). We will add a concise statement of these limiting cases and a pointer to the equations in the revised abstract. revision: yes
Referee: [Abstract] Abstract: the experimental claim that 'searched operators achieve competitive accuracy-efficiency trade-offs' is stated without any quantitative numbers, specific benchmarks, baselines, or ablation results, preventing assessment of whether the gains are meaningful or whether the search actually improves upon the recovered baselines.

Authors: The abstract is a high-level summary; quantitative results (accuracy/FLOPs on VQAv2, GQA, MM-Vet; comparisons to manual baselines and ablations) appear in Sections 4–5. We will incorporate the key numerical trade-offs into the abstract to make the experimental claim self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity: framework explicitly constructs shared space to include baselines as special cases; recovery is definitional design, not hidden reduction.

full rationale

The abstract states the parameterization is built to recover hand-designed operators as special cases, which is an explicit modeling choice rather than a derivation that reduces to fitted inputs or self-citations. No equations are shown that would make performance predictions equivalent to the search inputs by construction. No self-citation chains or uniqueness theorems from prior author work are invoked as load-bearing. The central claim rests on the differentiability of the search and empirical results, which are independent of the unification premise. This matches the default expectation of a self-contained framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the search space and constraints are described at a high level only.

pith-pipeline@v0.9.1-grok · 5693 in / 1062 out tokens · 15287 ms · 2026-06-28T07:28:26.650185+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 17 canonical work pages · 12 internal anchors

[1]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

2024
[3]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019
[6]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

2016
[8]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023
[11]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

2023
[12]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[13]

Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, and Honggang Chen. Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7350–7358, 2026

2026
[14]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[15]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 10

2024
[16]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

2022
[17]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

2022
[18]

Neural architecture retrieval.arXiv preprint arXiv:2307.07919, 2023

Xiaohuan Pei, Yanxi Li, Minjing Dong, and Chang Xu. Neural architecture retrieval.arXiv preprint arXiv:2307.07919, 2023

work page arXiv 2023
[19]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024
[20]

Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025
[21]

Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

Xiaohuan Pei, Tao Huang, YanXiang Ma, and Chang Xu. Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

work page arXiv 2025
[22]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

2025
[23]

Catching the details: Self-distilled roi predictors for fine-grained mllm perception.arXiv preprint arXiv:2509.16944, 2025

Yuheng Shi, Xiaohuan Pei, Minjing Dong, and Chang Xu. Catching the details: Self-distilled roi predictors for fine-grained mllm perception.arXiv preprint arXiv:2509.16944, 2025

work page arXiv 2025
[24]

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, and Chang Xu. Q-zoom: Query- aware adaptive perception for efficient multimodal large language models.arXiv preprint arXiv:2604.06912, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

2019
[26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

2025
[31]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11 A Technical appendices and supplementary material This appendix expands on four aspect...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

2024

[3] [3]

Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision- language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024

[4] [4]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019

[6] [6]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

2016

[8] [8]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

2023

[11] [11]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 292–305, 2023

2023

[12] [12]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023

[13] [13]

Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, and Honggang Chen. Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7350–7358, 2026

2026

[14] [14]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024

[15] [15]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 10

2024

[16] [16]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35: 2507–2521, 2022

2022

[17] [17]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

2022

[18] [18]

Neural architecture retrieval.arXiv preprint arXiv:2307.07919, 2023

Xiaohuan Pei, Yanxi Li, Minjing Dong, and Chang Xu. Neural architecture retrieval.arXiv preprint arXiv:2307.07919, 2023

work page arXiv 2023

[19] [19]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024

[20] [20]

Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

Xiaohuan Pei, Yuxing Chen, Siyu Xu, Yunke Wang, Yuheng Shi, and Chang Xu. Action- aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093, 2025

work page arXiv 2025

[21] [21]

Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

Xiaohuan Pei, Tao Huang, YanXiang Ma, and Chang Xu. Rethinking causal mask attention for vision-language inference.arXiv preprint arXiv:2505.18605, 2025

work page arXiv 2025

[22] [22]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

2025

[23] [23]

Catching the details: Self-distilled roi predictors for fine-grained mllm perception.arXiv preprint arXiv:2509.16944, 2025

Yuheng Shi, Xiaohuan Pei, Minjing Dong, and Chang Xu. Catching the details: Self-distilled roi predictors for fine-grained mllm perception.arXiv preprint arXiv:2509.16944, 2025

work page arXiv 2025

[24] [24]

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, and Chang Xu. Q-zoom: Query- aware adaptive perception for efficient multimodal large language models.arXiv preprint arXiv:2604.06912, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

2019

[26] [26]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025

2025

[31] [31]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11 A Technical appendices and supplementary material This appendix expands on four aspect...

work page internal anchor Pith review Pith/arXiv arXiv 2024