A More Word-like Image Tokenization for MLLMs

Hyemin Jeong; Hyungwook Choi; Hyun Lee; Hyunsoo Cho; Joonseok Lee; Soo Kyung Kim; Yejin Kim

arxiv: 2605.17954 · v1 · pith:HDKS4VNRnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.LG

A More Word-like Image Tokenization for MLLMs

Hyun Lee , Hyemin Jeong , Yejin Kim , Hyungwook Choi , Hyunsoo Cho , Soo Kyung Kim , Joonseok Lee This is my paper

Pith reviewed 2026-05-20 12:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords visual tokenizationmultimodal large language modelsimage clusteringefficient inferencesemantic tokenstoken budget

0 comments

The pith

Clustering image patches into semantic units produces fewer word-like visual tokens for multimodal models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern MLLMs convert images into long streams of continuous embeddings that differ from the discrete word tokens LLMs expect. DiVT clusters patch embeddings from the vision encoder into groups that each represent a distinct visual concept rather than a fixed grid cell. The method dynamically adjusts the total number of tokens according to image complexity. On multiple multimodal benchmarks it matches or exceeds standard projectors while using substantially fewer tokens. This shortens sequence length and thereby lowers memory use and latency without any change to the vision encoder or language model.

Core claim

DiVT replaces fixed-grid visual tokenization with a clustering step that groups patch embeddings into coherent semantic units. Each output token therefore corresponds to one visual concept instead of one spatial location. The clustering also adapts the token budget to the input image so that simpler scenes receive fewer tokens and complex scenes receive more, supplying an explicit accuracy-compute trade-off.

What carries the argument

Disentangled Visual Tokenization (DiVT), a post-encoder clustering operation that converts a dense grid of patch embeddings into a shorter sequence of semantically distinct tokens.

If this is right

Multimodal benchmarks are solved at equal or higher accuracy with markedly fewer visual tokens.
Memory footprint and inference latency drop in proportion to the reduction in token count.
Token budget can be scaled directly with scene complexity to trade accuracy for speed.
Visual inputs become more compatible with the discrete-token regime the language model was originally trained on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering idea could be applied to video frames treated as extended patch sequences to control token growth over time.
Lower token counts might allow higher-resolution inputs to be processed without quadratic growth in compute.
Semantic clusters might serve as an interpretable intermediate representation for debugging what the model attends to.

Load-bearing premise

Grouping nearby patch embeddings into coherent semantic clusters produces tokens that the fixed language model can treat like discrete word units.

What would settle it

A controlled test on a benchmark of highly detailed images in which DiVT requires the same or greater number of tokens as a standard projector to reach equivalent accuracy, or in which the clustered tokens show no measurable increase in compatibility with the language model’s attention patterns.

Figures

Figures reproduced from arXiv: 2605.17954 by Hyemin Jeong, Hyungwook Choi, Hyun Lee, Hyunsoo Cho, Joonseok Lee, Soo Kyung Kim, Yejin Kim.

**Figure 1.** Figure 1: Comparison with existing projectors. Patch features (bottom layer) are mapped to visual tokens (top layer). Each color represents the principal semantic of the patch. pre-trained LLM for its reasoning ability and a pre-trained vision encoder (e.g. CLIP [35], SigLIP [49]) to map the pixel-level signals to a semantic latent space. Since these two pre-trained models operate on different latent spaces, a visua… view at source ↗

**Figure 2.** Figure 2: Patch similarity across ViT layers. Patch-wise cosine similarity increases in deeper layers, indicating that repeated selfattention homogenizes patch embeddings within an image. this threshold can also be adjusted at inference time, allowing practitioners to trade-off representational detail against memory and latency without retraining, and to match computational budgets in deployment. We evaluate our … view at source ↗

**Figure 3.** Figure 3: Overview of DiVT. The process consists of three main stages: (1) Initial patch clustering, which elects representative patch centroids based on feature diversity (Sec. 3.1); (2) Cluster refinement for semantically more coherent groups (Sec. 3.2); (3) Visual token formulation to aggregate information within each cluster to semantically disentangled visual tokens (Sec. 3.3). of 500 images from MMBench [32] … view at source ↗

**Figure 4.** Figure 4: Illustration of dynamic token clustering. An image with relatively simpler content (top) uses less number of clusters than one with a more complex scene (bottom). See Appendix E for more examples. supporting adaptive token lengths for each image. Unlike the linear projectors that ignore patch entanglement, our approach restructures visual information into semantic units, better compatible with the discret… view at source ↗

**Figure 5.** Figure 5: Qualitative demonstration. Attention maps highlight the regions in the image that the model attends to for specific object tokens. Our method produces attention clusters tightly focused on the object token, yielding more interpretable patterns, while the MLP projector exhibits more scattered attention over irrelevant regions. See Appendix D for more examples. Backbone # Tokens MMB VQAv2 GQA MME MM-Vet VQAT… view at source ↗

read the original abstract

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Disentangled Visual Tokenization (DiVT) for MLLMs. It clusters patch embeddings from a fixed vision encoder into coherent semantic units so that each visual token corresponds to a distinct concept rather than a grid cell, while also adapting the token budget to image complexity. The central claim is that this produces visual inputs more compatible with a frozen LLM's discrete token regime, allowing DiVT to match or exceed baseline performance on multimodal benchmarks with substantially fewer tokens and without any changes to the vision encoder or language model.

Significance. If the central claim holds, the work would be significant for efficient multimodal modeling: it offers an explicit accuracy-compute trade-off and a concrete mechanism for reducing memory and latency while preserving or improving downstream performance. The public release of code is a clear strength that enables direct verification of the clustering procedure and adaptive budget.

major comments (2)

[Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.
[Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.

minor comments (2)

[Figures and Tables] Figure captions and tables should explicitly report the average and maximum token reduction percentages alongside the benchmark scores for direct comparison with baselines.
[Method] Notation for the adaptive token budget (e.g., how complexity is estimated and the exact threshold function) should be formalized with an equation to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to incorporate additional evidence and details where appropriate.

read point-by-point responses

Referee: [Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.

Authors: We agree that direct evidence isolating semantic coherence from length reduction would strengthen the central claim. Our primary results already indicate that the benefit is not solely from shorter sequences, as DiVT with fewer tokens matches or exceeds the performance of standard full-token baselines (where simply reducing token count in fixed-grid methods typically degrades results). To address this explicitly, the revised manuscript now includes embedding-space similarity statistics between DiVT tokens and LLM word embeddings, attention-map comparisons demonstrating more concept-focused patterns, and an ablation contrasting semantic clustering against length-matched but non-semantic token reduction. revision: yes
Referee: [Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.

Authors: We acknowledge that the abstract and initial method description lacked sufficient quantitative specifics. The revised manuscript expands these sections to report the clustering algorithm details (including the adaptive mechanism for determining cluster count based on image complexity and the specific hyperparameters used), includes error bars from multiple random seeds, clarifies the exact dataset splits for all benchmarks, and adds ablation studies examining sensitivity to clustering parameters and token budget choices to confirm robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no self-referential reductions or fitted predictions presented as derivations.

full rationale

The paper proposes DiVT as a clustering-based tokenization technique that adapts token count to image complexity, then reports empirical results on benchmarks. No equations, first-principles derivations, or mathematical claims are present in the provided text. The central claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed fitted parameters. The adaptive budget is described as an explicit design choice for accuracy-compute trade-off, not as a prediction derived from the same data used for evaluation. This is a standard self-contained empirical contribution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit parameters, axioms, or invented entities; clustering and adaptation mechanisms are described at high level only.

pith-pipeline@v0.9.0 · 5758 in / 1090 out tokens · 31238 ms · 2026-05-20T12:42:35.176563+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

clusters patch embeddings into coherent semantic units... similarity threshold θ... adaptive token budget
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dynamic token allocation... semantic granularity control

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 12 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv:2404.14219, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DivPrune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 7, 8

work page 2025
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv:2309.16609, 2023. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv:2407.07726, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 5

work page 2023
[6]

Honeybee: Locality-enhanced projector for multimodal LLM

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal LLM. InCVPR, 2024. 1, 5, 6, 8

work page 2024
[7]

Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models. arXiv:2509.01552, 2025. 7, 8

work page arXiv 2025
[8]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic.arXiv:2306.15195, 2023. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 5, 7, 8

work page 2024
[10]

How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page
[11]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 8

work page 2024
[12]

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, and Sungwoong Kim. Slot-MLLM: Object-centric visual tokenization for multimodal LLM. arXiv:2505.17726, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 6

work page 2023
[14]

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. MobileVLM v2: Faster and stronger base- line for vision language model.arXiv:2402.03766, 2024. 1, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

InstructBLIP: Towards general-purpose vision- language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1, 8

work page 2023
[16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

work page 2021
[17]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. InACL, 2024. 8

work page 2024
[18]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Making the V in VQA matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InCVPR, 2017. 5

work page 2017
[20]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InConference on language modeling, 2024. 8

work page 2024
[21]

iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024

Lianyu Hu, Fanhua Shang, Liang Wan, and Wei Feng. iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024. 7, 8

work page arXiv 2024
[22]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

GQA: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 5

work page 2019
[24]

Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, 2024. 8

work page 2024
[25]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In ICML, 2023. 8

work page 2023
[26]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,

work page
[27]

Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025. 1, 5, 6, 8

work page 2025
[28]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 5

work page 2023
[29]

VILA: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. InCVPR, 2024. 8

work page 2024
[30]

Boosting multimodal large language models with visual to- kens withdrawal for rapid inference

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 7, 8

work page 2025
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 5, 8

work page 2023
[32]

MMBench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? InECCV, 2024. 3, 5

work page 2024
[33]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

work page
[34]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 6

work page 2021
[36]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016. 2

work page 2016
[37]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 5, 7, 8

work page 2025
[38]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

work page
[39]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv:2403.08295, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

FlashSloth: Light- ning multimodal large language models via embedded visual compression

Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, and Rongrong Ji. FlashSloth: Light- ning multimodal large language models via embedded visual compression. InCVPR, 2025. 1

work page 2025
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Towards semantic equivalence of tokenization in multimodal llm.arXiv preprint arXiv:2406.05127, 2024

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv:2406.05127, 2024. 8

work page arXiv 2024
[43]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concentration for efficient large vision-language models. InCVPR, 2025. 7, 8

work page 2025
[44]

freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024

Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, and Yan Yan. freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024. 8

work page arXiv 2024
[45]

VisionZip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. In CVPR, 2025. 5, 7, 8

work page 2025
[46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V level mllm on your phone. arXiv:2408.01800, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

ATP-LLaV A: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InCVPR, 2025. 5, 7, 8

work page 2025
[48]

MM-Vet: evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: evaluating large multimodal models for integrated capabilities. InICML, 2024. 5

work page 2024
[49]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 6

work page 2023
[50]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InICCV, 2025. 5, 7, 8

work page 2025
[51]

Improving open-ended text generation via adap- tive decoding

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adap- tive decoding. InICML, 2024. 8

work page 2024
[52]

Describe this image

Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. LLaV A-Phi: Efficient multi-modal assistant with small language model. InInternational Workshop on Effi- cient Multimedia Computing under Limited, 2024. 8 A More Word-like Image Tokenization for MLLMs Supplementary Material A. Implementation Details Hyperparameters.Our implementation closely fo...

work page 2024

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv:2404.14219, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DivPrune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 7, 8

work page 2025

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv:2309.16609, 2023. 1, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv:2407.07726, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 5

work page 2023

[6] [6]

Honeybee: Locality-enhanced projector for multimodal LLM

Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal LLM. InCVPR, 2024. 1, 5, 6, 8

work page 2024

[7] [7]

Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models. arXiv:2509.01552, 2025. 7, 8

work page arXiv 2025

[8] [8]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic.arXiv:2306.15195, 2023. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 5, 7, 8

work page 2024

[10] [10]

How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page

[11] [11]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 8

work page 2024

[12] [12]

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, and Sungwoong Kim. Slot-MLLM: Object-centric visual tokenization for multimodal LLM. arXiv:2505.17726, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 6

work page 2023

[14] [14]

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. MobileVLM v2: Faster and stronger base- line for vision language model.arXiv:2402.03766, 2024. 1, 6, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

InstructBLIP: Towards general-purpose vision- language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1, 8

work page 2023

[16] [16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

work page 2021

[17] [17]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. InACL, 2024. 8

work page 2024

[18] [18]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Making the V in VQA matter: El- evating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InCVPR, 2017. 5

work page 2017

[20] [20]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InConference on language modeling, 2024. 8

work page 2024

[21] [21]

iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024

Lianyu Hu, Fanhua Shang, Liang Wan, and Wei Feng. iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024. 7, 8

work page arXiv 2024

[22] [22]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

GQA: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 5

work page 2019

[24] [24]

Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, 2024. 8

work page 2024

[25] [25]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In ICML, 2023. 8

work page 2023

[26] [26]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,

work page

[27] [27]

Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025

Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025. 1, 5, 6, 8

work page 2025

[28] [28]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 5

work page 2023

[29] [29]

VILA: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. InCVPR, 2024. 8

work page 2024

[30] [30]

Boosting multimodal large language models with visual to- kens withdrawal for rapid inference

Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 7, 8

work page 2025

[31] [31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 5, 8

work page 2023

[32] [32]

MMBench: Is your multi-modal model an all-around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? InECCV, 2024. 3, 5

work page 2024

[33] [33]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

work page

[34] [34]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 6

work page 2021

[36] [36]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016. 2

work page 2016

[37] [37]

LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 5, 7, 8

work page 2025

[38] [38]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

work page

[39] [39]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv:2403.08295, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

FlashSloth: Light- ning multimodal large language models via embedded visual compression

Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, and Rongrong Ji. FlashSloth: Light- ning multimodal large language models via embedded visual compression. InCVPR, 2025. 1

work page 2025

[41] [41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Towards semantic equivalence of tokenization in multimodal llm.arXiv preprint arXiv:2406.05127, 2024

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv:2406.05127, 2024. 8

work page arXiv 2024

[43] [43]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concentration for efficient large vision-language models. InCVPR, 2025. 7, 8

work page 2025

[44] [44]

freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024

Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, and Yan Yan. freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024. 8

work page arXiv 2024

[45] [45]

VisionZip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. In CVPR, 2025. 5, 7, 8

work page 2025

[46] [46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V level mllm on your phone. arXiv:2408.01800, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

ATP-LLaV A: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InCVPR, 2025. 5, 7, 8

work page 2025

[48] [48]

MM-Vet: evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: evaluating large multimodal models for integrated capabilities. InICML, 2024. 5

work page 2024

[49] [49]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 6

work page 2023

[50] [50]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InICCV, 2025. 5, 7, 8

work page 2025

[51] [51]

Improving open-ended text generation via adap- tive decoding

Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adap- tive decoding. InICML, 2024. 8

work page 2024

[52] [52]

Describe this image

Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. LLaV A-Phi: Efficient multi-modal assistant with small language model. InInternational Workshop on Effi- cient Multimedia Computing under Limited, 2024. 8 A More Word-like Image Tokenization for MLLMs Supplementary Material A. Implementation Details Hyperparameters.Our implementation closely fo...

work page 2024