How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

Daling Wang; Junzhao Huang; Shi Feng; Xiaocui Yang; Yifei Zhang; Yijie Huang; Yiqun Zhang; Yongkang Liu; Zhuoyue Jia; Zihan Wang

arxiv: 2605.16359 · v1 · pith:BCC2U5KAnew · submitted 2026-05-09 · 💻 cs.CV · cs.AI

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F³A

YiJie Huang , Yiqun Zhang , Zhuoyue Jia , Xiaocui Yang , Junzhao Huang , Zihan Wang , Shi Feng , Daling Wang

show 2 more authors

Yifei Zhang Yongkang Liu

This is my paper

Pith reviewed 2026-05-20 22:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual token pruningmultimodal language modelstraining-free pruningevidence searchtoken budget allocationvision-language modelsinference efficiencytask-conditioned routing

0 comments

The pith

Viewing visual token pruning as task-conditioned evidence search enables a training-free router to allocate fewer tokens effectively in multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current one-shot proxies for pruning visual tokens fall short when compression becomes aggressive or models scale up. It argues instead for treating pruning as searching for evidence relevant to the given task or question. A reader would care because feeding ever-longer image token sequences drives up inference costs in vision-language models. F^3A demonstrates this view by building question-conditioned cues, matching them with frozen sensing heads, and allocating a fixed token budget in four steps. The approach leaves the original prompting and decoding pipeline unchanged and requires no training or extra language-model passes.

Core claim

We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

What carries the argument

F^3A, a training-free router that performs task-conditioned evidence search by matching question-conditioned cues to visual-grid tokens via frozen sparse sensing heads followed by a four-step allocation process.

If this is right

Higher task accuracy is retained compared with one-shot proxy methods when the visual token budget is severely restricted.
The same allocation process scales to larger multimodal models without any retraining or architectural changes.
Inference cost drops because fewer visual tokens are passed to the language backbone while the original prompting format stays intact.
The four-step process can be inserted before any existing vision-language pipeline without modifying decoding behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The evidence-search framing could be tested on pruning strategies for other sequential inputs such as audio frames or long text contexts.
Dynamic budget adjustment based on query complexity might emerge if the cue-matching step is exposed to the serving system.
Models could be pretrained with lightweight sparse heads from the start so that similar pruning becomes a native capability rather than a post-hoc router.

Load-bearing premise

Lightweight question-conditioned cues can be reliably matched to visual-grid tokens through frozen sparse sensing heads to guide effective pruning and budget allocation without any model training or additional LLM computation.

What would settle it

An experiment that applies F^3A and standard one-shot pruning baselines to a visual question answering benchmark at an aggressive budget such as 10 percent of original tokens and finds that F^3A yields lower accuracy.

Figures

Figures reproduced from arXiv: 2605.16359 by Daling Wang, Junzhao Huang, Shi Feng, Xiaocui Yang, Yifei Zhang, Yijie Huang, Yiqun Zhang, Yongkang Liu, Zhuoyue Jia, Zihan Wang.

**Figure 2.** Figure 2: Overview of F3A. Prompt-conditioned cues guide a three-stage foraging process: coarse search, visual lock-on, and rescue jump. The selected visual tokens replace the full visual block before frozen LLM prefill, without finetuning or decoding changes. 3 Method 3.1 From Fruit-Fly Foraging to Token Selection Visual-token pruning is a fixed-budget evidence selection problem rather than ordinary saliency rankin… view at source ↗

**Figure 3.** Figure 3: Compression-aware scaling on Qwen3-VL. (a) Average accuracy of full-token inference [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: On Qwen3-VL family, each bar is the minimum visual token retention required to preserve [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Normalized ablation scores on Qwen3-VL-8B over HallusionBench, RealWorldQA, and [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F^3A reframes visual token pruning as task-conditioned evidence search with a four-step training-free router, but empirical support is still needed.

read the letter

The punchline is that F^3A proposes a training-free, multi-step router for pruning visual tokens by matching question cues to image tokens and then allocating budget through localization, refinement, competition, and recovery. This approach stands out because it moves away from single proxy methods toward a more structured search for relevant evidence. The use of frozen sparse sensing heads to handle the matching without training or additional model runs is a practical choice that preserves the original pipeline. The paper does well in identifying the scaling problem with long visual sequences and in outlining a clear sequence of steps that aim to balance coverage and task relevance under fixed budgets. Where it is soft is in the supporting evidence. The description does not include any experimental results, ablations, or direct comparisons, so it is difficult to assess how much better this performs than existing pruning techniques. The key assumption about reliable matching through the frozen heads lacks backing data at this point, which could be a load-bearing issue if the matches turn out noisy. This work is aimed at researchers and engineers dealing with the inference costs of multimodal language models. Anyone looking into visual token reduction for larger vision-language systems could find the conceptual shift and the specific F^3A steps worth exploring or adapting. It has enough substance in the problem framing and method design to merit a serious referee. The full paper likely contains the necessary experiments to test the claims. I would recommend putting it through peer review to get a proper evaluation of the results and implementation.

Referee Report

2 major / 2 minor

Summary. The paper argues that visual token pruning in multimodal LLMs is better framed as task-conditioned evidence search rather than one-shot proxies such as attention or diversity. It proposes F^3A, a training-free router that constructs lightweight question-conditioned cues, matches them to visual-grid tokens via frozen sparse sensing heads, and allocates a fixed token budget through a four-step process (coarse localization, local refinement, coverage-preserving competition, and recovery of under-covered regions). The method claims to require no model training, no extra LLM forward pass, and to preserve the original multimodal prompting/decoding pipeline.

Significance. If the core matching and allocation steps prove reliable, F^3A could provide a practical, training-free way to reduce inference cost under aggressive visual-token budgets and across model scales. The explicit separation of cue construction from the LLM forward pass and the use of frozen components are clear strengths that would distinguish the work from methods requiring fine-tuning or additional passes.

major comments (2)

[Method description (F^3A router and cue-to-token matching)] The load-bearing step is the claim that question-conditioned cues can be reliably matched to task-relevant visual tokens by frozen sparse sensing heads without training or an extra LLM pass. The manuscript supplies no localization metrics (precision, recall, or IoU against ground-truth evidence regions) or comparisons against simple baselines for this matching operation; if the matches are noisy, the subsequent four-step allocation cannot recover the asserted advantage under tight budgets.
[Abstract and experimental validation] The abstract states that the approach needs 'no model training, no extra LLM forward pass' and preserves the original pipeline, yet provides no empirical results, ablation studies, or quantitative comparisons to existing training-free pruning methods. Without these data the central scaling claim remains unsupported.

minor comments (2)

[Introduction / Method] Define the acronym F^3A explicitly on first use and clarify whether the sparse sensing heads are obtained from a separate pre-training stage or derived directly from the target VLM.
[Figures] Ensure any diagrams of the four-step allocation process include explicit token-budget numbers and before/after token counts for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Method description (F^3A router and cue-to-token matching)] The load-bearing step is the claim that question-conditioned cues can be reliably matched to task-relevant visual tokens by frozen sparse sensing heads without training or an extra LLM pass. The manuscript supplies no localization metrics (precision, recall, or IoU against ground-truth evidence regions) or comparisons against simple baselines for this matching operation; if the matches are noisy, the subsequent four-step allocation cannot recover the asserted advantage under tight budgets.

Authors: We agree that direct localization metrics would strengthen the presentation of the cue-to-token matching step. Because F^3A is strictly training-free and no ground-truth evidence-region annotations exist for standard VQA benchmarks, we instead validate the overall pipeline via end-task accuracy under varying token budgets. To address the concern directly, we will add a dedicated subsection with qualitative token-selection visualizations and quantitative comparisons against simple baselines (random selection and attention-based pruning) for the matching operation. revision: yes
Referee: [Abstract and experimental validation] The abstract states that the approach needs 'no model training, no extra LLM forward pass' and preserves the original pipeline, yet provides no empirical results, ablation studies, or quantitative comparisons to existing training-free pruning methods. Without these data the central scaling claim remains unsupported.

Authors: The abstract correctly summarizes the training-free and pipeline-preserving properties of F^3A. While the current manuscript contains initial scaling experiments across model sizes and token budgets, we acknowledge that the experimental section would benefit from expanded ablations and head-to-head comparisons with prior training-free methods. We will revise the experimental section accordingly and, if space permits, adjust the abstract to reference the new results. revision: yes

Circularity Check

0 steps flagged

No circularity: F^3A is a procedural router with no self-referential derivations or fitted predictions

full rationale

The paper presents F^3A as a training-free, pre-LLM procedural algorithm consisting of cue construction, matching via frozen heads, and a four-step allocation (coarse localization, refinement, competition, recovery). No equations, parameter fits, or derivations appear in the abstract or described method. The central claim (pruning as task-conditioned evidence search) is implemented directly by the stated steps without reducing to self-definition, renamed empirical patterns, or load-bearing self-citations. The method is independent of target performance metrics and does not invoke uniqueness theorems or ansatzes from prior author work. This is the normal case of a self-contained algorithmic proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that frozen sparse sensing heads can perform effective question-to-token matching without retraining or extra forward passes; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption Frozen sparse sensing heads can produce reliable matches between question-conditioned cues and visual-grid tokens for pruning decisions.
This assumption underpins the entire training-free router and is invoked when describing the matching step before allocation.

invented entities (1)

F^3A router no independent evidence
purpose: Training-free allocation of fixed visual token budget via multi-stage evidence search
New procedural component introduced to replace one-shot pruning proxies.

pith-pipeline@v0.9.0 · 5742 in / 1293 out tokens · 60215 ms · 2026-05-20T22:46:57.592306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue that visual token pruning is better viewed as task-conditioned evidence search... F³A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

F3A borrows this search order from fruit-fly optimization algorithms... odor field ai... sparse sensing heads... lock-on score mi = ai + λℓi − βri

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

[1]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[3]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[4]

2024 , eprint =

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models , author =. 2024 , eprint =

work page 2024
[5]

2024 , eprint =

VisionZip: Longer is Better but Not Necessary in Vision Language Models , author =. 2024 , eprint =

work page 2024
[6]

2025 , eprint =

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models , author =. 2025 , eprint =

work page 2025
[7]

2025 , eprint =

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs , author =. 2025 , eprint =

work page 2025
[8]

Knowledge-Based Systems , volume =

A New Fruit Fly Optimization Algorithm: Taking the Financial Distress Model as an Example , author =. Knowledge-Based Systems , volume =. 2012 , publisher =. doi:10.1016/j.knosys.2011.07.001 , url =

work page doi:10.1016/j.knosys.2011.07.001 2012
[9]

2023 , url=

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. 2023 , url=

work page 2023
[10]

International Conference on Machine Learning , year=

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution , author=. International Conference on Machine Learning , year=

work page
[11]

ArXiv , year=

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence , author=. ArXiv , year=

work page
[12]

Model Swarms: Collaborative Search to Adapt

Feng, Shangbin and Wang, Zifeng and Wang, Yike and Ebrahimi, Sayna and Palangi, Hamid and Miculicich, Lesly and Kulshrestha, Achin and Rauschmayr, Nathalie and Choi, Yejin and Tsvetkov, Yulia and Lee, Chen-Yu and Pfister, Tomas , booktitle =. Model Swarms: Collaborative Search to Adapt. 2025 , editor =

work page 2025
[13]

2025 , eprint =

Nature-Inspired Population-Based Evolution of Large Language Models , author =. 2025 , eprint =

work page 2025
[14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong and Shan, Caifeng and He, Ran , year =. 2306.13394 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Computer Vision -- ECCV 2016 , pages =

A Diagram Is Worth a Dozen Images , author =. Computer Vision -- ECCV 2016 , pages =. 2016 , publisher =. doi:10.1007/978-3-319-46493-0_15 , url =

work page doi:10.1007/978-3-319-46493-0_15 2016
[17]

2024 , howpublished =

work page 2024
[18]

Advances in Neural Information Processing Systems , volume =

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

work page 2022
[19]

In: Bouamor, H., Pino, J., Bali, K

Evaluating Object Hallucination in Large Vision-Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.20 , url =

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[20]

2024 , publisher =

Liu, Yuan and Duan, Haodong and Zhang, Yuanhan and Li, Bo and Zhang, Songyang and Zhao, Wangbo and Yuan, Yike and Wang, Jiaqi and He, Conghui and Liu, Ziwei and Chen, Kai and Lin, Dahua , booktitle =. 2024 , publisher =

work page 2024
[21]

Transactions of the Association for Computational Linguistics , volume =

Visual Spatial Reasoning , author =. Transactions of the Association for Computational Linguistics , volume =. 2023 , publisher =. doi:10.1162/tacl_a_00566 , url =

work page doi:10.1162/tacl_a_00566 2023
[22]

2016 , doi =

Zhu, Yuke and Groth, Oliver and Bernstein, Michael and Fei-Fei, Li , booktitle =. 2016 , doi =

work page 2016
[23]

arXiv preprint arXiv:2505.18757 , year =

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance , author =. arXiv preprint arXiv:2505.18757 , year =

work page arXiv
[24]

ArXiv , year=

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models , author=. ArXiv , year=

work page
[25]

ArXiv , year=

Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models , author=. ArXiv , year=

work page
[26]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[27]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

International Conference on Learning Representations , year=

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters , author=. International Conference on Learning Representations , year=

work page
[29]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2403.11703 , year=

LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images , author=. arXiv preprint arXiv:2403.11703 , year=

work page arXiv
[31]

ArXiv , year=

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search , author=. ArXiv , year=

work page
[32]

A new Fruit Fly Optimization Algorithm: Taking the financial distress model as an example , author=. Knowl. Based Syst. , year=

work page

[1] [1]

Qwen3-VL Technical Report

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[3] [3]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025

[4] [4]

2024 , eprint =

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models , author =. 2024 , eprint =

work page 2024

[5] [5]

2024 , eprint =

VisionZip: Longer is Better but Not Necessary in Vision Language Models , author =. 2024 , eprint =

work page 2024

[6] [6]

2025 , eprint =

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models , author =. 2025 , eprint =

work page 2025

[7] [7]

2025 , eprint =

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs , author =. 2025 , eprint =

work page 2025

[8] [8]

Knowledge-Based Systems , volume =

A New Fruit Fly Optimization Algorithm: Taking the Financial Distress Model as an Example , author =. Knowledge-Based Systems , volume =. 2012 , publisher =. doi:10.1016/j.knosys.2011.07.001 , url =

work page doi:10.1016/j.knosys.2011.07.001 2012

[9] [9]

2023 , url=

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers , author=. 2023 , url=

work page 2023

[10] [10]

International Conference on Machine Learning , year=

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution , author=. International Conference on Machine Learning , year=

work page

[11] [11]

ArXiv , year=

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence , author=. ArXiv , year=

work page

[12] [12]

Model Swarms: Collaborative Search to Adapt

Feng, Shangbin and Wang, Zifeng and Wang, Yike and Ebrahimi, Sayna and Palangi, Hamid and Miculicich, Lesly and Kulshrestha, Achin and Rauschmayr, Nathalie and Choi, Yejin and Tsvetkov, Yulia and Lee, Chen-Yu and Pfister, Tomas , booktitle =. Model Swarms: Collaborative Search to Adapt. 2025 , editor =

work page 2025

[13] [13]

2025 , eprint =

Nature-Inspired Population-Based Evolution of Large Language Models , author =. 2025 , eprint =

work page 2025

[14] [14]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page

[15] [15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Yang, Jinrui and Zheng, Xiawu and Li, Ke and Sun, Xing and Wu, Yunsheng and Ji, Rongrong and Shan, Caifeng and He, Ran , year =. 2306.13394 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Computer Vision -- ECCV 2016 , pages =

A Diagram Is Worth a Dozen Images , author =. Computer Vision -- ECCV 2016 , pages =. 2016 , publisher =. doi:10.1007/978-3-319-46493-0_15 , url =

work page doi:10.1007/978-3-319-46493-0_15 2016

[17] [17]

2024 , howpublished =

work page 2024

[18] [18]

Advances in Neural Information Processing Systems , volume =

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

work page 2022

[19] [19]

In: Bouamor, H., Pino, J., Bali, K

Evaluating Object Hallucination in Large Vision-Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.20 , url =

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[20] [20]

2024 , publisher =

Liu, Yuan and Duan, Haodong and Zhang, Yuanhan and Li, Bo and Zhang, Songyang and Zhao, Wangbo and Yuan, Yike and Wang, Jiaqi and He, Conghui and Liu, Ziwei and Chen, Kai and Lin, Dahua , booktitle =. 2024 , publisher =

work page 2024

[21] [21]

Transactions of the Association for Computational Linguistics , volume =

Visual Spatial Reasoning , author =. Transactions of the Association for Computational Linguistics , volume =. 2023 , publisher =. doi:10.1162/tacl_a_00566 , url =

work page doi:10.1162/tacl_a_00566 2023

[22] [22]

2016 , doi =

Zhu, Yuke and Groth, Oliver and Bernstein, Michael and Fei-Fei, Li , booktitle =. 2016 , doi =

work page 2016

[23] [23]

arXiv preprint arXiv:2505.18757 , year =

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance , author =. arXiv preprint arXiv:2505.18757 , year =

work page arXiv

[24] [24]

ArXiv , year=

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models , author=. ArXiv , year=

work page

[25] [25]

ArXiv , year=

Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models , author=. ArXiv , year=

work page

[26] [26]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[27] [27]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

International Conference on Learning Representations , year=

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters , author=. International Conference on Learning Representations , year=

work page

[29] [29]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2403.11703 , year=

LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images , author=. arXiv preprint arXiv:2403.11703 , year=

work page arXiv

[31] [31]

ArXiv , year=

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search , author=. ArXiv , year=

work page

[32] [32]

A new Fruit Fly Optimization Algorithm: Taking the financial distress model as an example , author=. Knowl. Based Syst. , year=

work page