pith. sign in

arxiv: 2605.17954 · v1 · pith:HDKS4VNRnew · submitted 2026-05-18 · 💻 cs.CV · cs.AI· cs.LG

A More Word-like Image Tokenization for MLLMs

Pith reviewed 2026-05-20 12:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords visual tokenizationmultimodal large language modelsimage clusteringefficient inferencesemantic tokenstoken budget
0
0 comments X

The pith

Clustering image patches into semantic units produces fewer word-like visual tokens for multimodal models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern MLLMs convert images into long streams of continuous embeddings that differ from the discrete word tokens LLMs expect. DiVT clusters patch embeddings from the vision encoder into groups that each represent a distinct visual concept rather than a fixed grid cell. The method dynamically adjusts the total number of tokens according to image complexity. On multiple multimodal benchmarks it matches or exceeds standard projectors while using substantially fewer tokens. This shortens sequence length and thereby lowers memory use and latency without any change to the vision encoder or language model.

Core claim

DiVT replaces fixed-grid visual tokenization with a clustering step that groups patch embeddings into coherent semantic units. Each output token therefore corresponds to one visual concept instead of one spatial location. The clustering also adapts the token budget to the input image so that simpler scenes receive fewer tokens and complex scenes receive more, supplying an explicit accuracy-compute trade-off.

What carries the argument

Disentangled Visual Tokenization (DiVT), a post-encoder clustering operation that converts a dense grid of patch embeddings into a shorter sequence of semantically distinct tokens.

If this is right

  • Multimodal benchmarks are solved at equal or higher accuracy with markedly fewer visual tokens.
  • Memory footprint and inference latency drop in proportion to the reduction in token count.
  • Token budget can be scaled directly with scene complexity to trade accuracy for speed.
  • Visual inputs become more compatible with the discrete-token regime the language model was originally trained on.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering idea could be applied to video frames treated as extended patch sequences to control token growth over time.
  • Lower token counts might allow higher-resolution inputs to be processed without quadratic growth in compute.
  • Semantic clusters might serve as an interpretable intermediate representation for debugging what the model attends to.

Load-bearing premise

Grouping nearby patch embeddings into coherent semantic clusters produces tokens that the fixed language model can treat like discrete word units.

What would settle it

A controlled test on a benchmark of highly detailed images in which DiVT requires the same or greater number of tokens as a standard projector to reach equivalent accuracy, or in which the clustered tokens show no measurable increase in compatibility with the language model’s attention patterns.

Figures

Figures reproduced from arXiv: 2605.17954 by Hyemin Jeong, Hyungwook Choi, Hyun Lee, Hyunsoo Cho, Joonseok Lee, Soo Kyung Kim, Yejin Kim.

Figure 1
Figure 1. Figure 1: Comparison with existing projectors. Patch features (bottom layer) are mapped to visual tokens (top layer). Each color represents the principal semantic of the patch. pre-trained LLM for its reasoning ability and a pre-trained vision encoder (e.g. CLIP [35], SigLIP [49]) to map the pixel-level signals to a semantic latent space. Since these two pre-trained models operate on different latent spaces, a visua… view at source ↗
Figure 2
Figure 2. Figure 2: Patch similarity across ViT layers. Patch-wise cosine similarity increases in deeper layers, indicating that repeated self￾attention homogenizes patch embeddings within an image. this threshold can also be adjusted at inference time, allow￾ing practitioners to trade-off representational detail against memory and latency without retraining, and to match com￾putational budgets in deployment. We evaluate our … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DiVT. The process consists of three main stages: (1) Initial patch clustering, which elects representative patch centroids based on feature diversity (Sec. 3.1); (2) Cluster refinement for semantically more coherent groups (Sec. 3.2); (3) Visual token formulation to aggregate information within each cluster to semantically disentangled visual tokens (Sec. 3.3). of 500 images from MMBench [32] … view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of dynamic token clustering. An image with relatively simpler content (top) uses less number of clusters than one with a more complex scene (bottom). See Appendix E for more examples. supporting adaptive token lengths for each image. Unlike the linear projectors that ignore patch entanglement, our ap￾proach restructures visual information into semantic units, better compatible with the discret… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative demonstration. Attention maps highlight the regions in the image that the model attends to for specific object tokens. Our method produces attention clusters tightly focused on the object token, yielding more interpretable patterns, while the MLP projector exhibits more scattered attention over irrelevant regions. See Appendix D for more examples. Backbone # Tokens MMB VQAv2 GQA MME MM-Vet VQAT… view at source ↗
read the original abstract

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Disentangled Visual Tokenization (DiVT) for MLLMs. It clusters patch embeddings from a fixed vision encoder into coherent semantic units so that each visual token corresponds to a distinct concept rather than a grid cell, while also adapting the token budget to image complexity. The central claim is that this produces visual inputs more compatible with a frozen LLM's discrete token regime, allowing DiVT to match or exceed baseline performance on multimodal benchmarks with substantially fewer tokens and without any changes to the vision encoder or language model.

Significance. If the central claim holds, the work would be significant for efficient multimodal modeling: it offers an explicit accuracy-compute trade-off and a concrete mechanism for reducing memory and latency while preserving or improving downstream performance. The public release of code is a clear strength that enables direct verification of the clustering procedure and adaptive budget.

major comments (2)
  1. [Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.
  2. [Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.
minor comments (2)
  1. [Figures and Tables] Figure captions and tables should explicitly report the average and maximum token reduction percentages alongside the benchmark scores for direct comparison with baselines.
  2. [Method] Notation for the adaptive token budget (e.g., how complexity is estimated and the exact threshold function) should be formalized with an equation to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to incorporate additional evidence and details where appropriate.

read point-by-point responses
  1. Referee: [Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.

    Authors: We agree that direct evidence isolating semantic coherence from length reduction would strengthen the central claim. Our primary results already indicate that the benefit is not solely from shorter sequences, as DiVT with fewer tokens matches or exceeds the performance of standard full-token baselines (where simply reducing token count in fixed-grid methods typically degrades results). To address this explicitly, the revised manuscript now includes embedding-space similarity statistics between DiVT tokens and LLM word embeddings, attention-map comparisons demonstrating more concept-focused patterns, and an ablation contrasting semantic clustering against length-matched but non-semantic token reduction. revision: yes

  2. Referee: [Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.

    Authors: We acknowledge that the abstract and initial method description lacked sufficient quantitative specifics. The revised manuscript expands these sections to report the clustering algorithm details (including the adaptive mechanism for determining cluster count based on image complexity and the specific hyperparameters used), includes error bars from multiple random seeds, clarifies the exact dataset splits for all benchmarks, and adds ablation studies examining sensitivity to clustering parameters and token budget choices to confirm robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no self-referential reductions or fitted predictions presented as derivations.

full rationale

The paper proposes DiVT as a clustering-based tokenization technique that adapts token count to image complexity, then reports empirical results on benchmarks. No equations, first-principles derivations, or mathematical claims are present in the provided text. The central claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed fitted parameters. The adaptive budget is described as an explicit design choice for accuracy-compute trade-off, not as a prediction derived from the same data used for evaluation. This is a standard self-contained empirical contribution without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit parameters, axioms, or invented entities; clustering and adaptation mechanisms are described at high level only.

pith-pipeline@v0.9.0 · 5758 in / 1090 out tokens · 31238 ms · 2026-05-20T12:42:35.176563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 12 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv:2404.14219, 2024. 8

  2. [2]

    DivPrune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 7, 8

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv:2309.16609, 2023. 1, 5, 6

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv:2407.07726, 2024. 8

  5. [5]

    Token merging: Your ViT but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 5

  6. [6]

    Honeybee: Locality-enhanced projector for multimodal LLM

    Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal LLM. InCVPR, 2024. 1, 5, 6, 8

  7. [7]

    Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

    Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models. arXiv:2509.01552, 2025. 7, 8

  8. [8]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic.arXiv:2306.15195, 2023. 1, 8

  9. [9]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 5, 7, 8

  10. [10]

    How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

  11. [11]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 8

  12. [12]

    Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

    Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, and Sungwoong Kim. Slot-MLLM: Object-centric visual tokenization for multimodal LLM. arXiv:2505.17726, 2025. 8

  13. [13]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 6

  14. [14]

    MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

    Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. MobileVLM v2: Faster and stronger base- line for vision language model.arXiv:2402.03766, 2024. 1, 6, 8

  15. [15]

    InstructBLIP: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1, 8

  16. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1

  17. [17]

    Layer- skip: Enabling early exit inference and self-speculative de- coding

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. InACL, 2024. 8

  18. [18]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

  19. [19]

    Making the V in VQA matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InCVPR, 2017. 5

  20. [20]

    Mamba: Linear-time sequence mod- eling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InConference on language modeling, 2024. 8

  21. [21]

    iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024

    Lianyu Hu, Fanhua Shang, Liang Wan, and Wei Feng. iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024. 7, 8

  22. [22]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024. 8

  23. [23]

    GQA: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 5

  24. [24]

    Chat-univi: Unified visual representation em- powers large language models with image and video under- standing

    Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, 2024. 8

  25. [25]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In ICML, 2023. 8

  26. [26]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,

  27. [27]

    Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025. 1, 5, 6, 8

  28. [28]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 5

  29. [29]

    VILA: On pre-training for vi- sual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. InCVPR, 2024. 8

  30. [30]

    Boosting multimodal large language models with visual to- kens withdrawal for rapid inference

    Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 7, 8

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 5, 8

  32. [32]

    MMBench: Is your multi-modal model an all-around player? InECCV, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? InECCV, 2024. 3, 5

  33. [33]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

  34. [34]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.arXiv:2304.07193, 2023. 6

  35. [35]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 6

  36. [36]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016. 2

  37. [37]

    LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 5, 7, 8

  38. [38]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,

  39. [39]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv:2403.08295, 2024. 8

  40. [40]

    FlashSloth: Light- ning multimodal large language models via embedded visual compression

    Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, and Rongrong Ji. FlashSloth: Light- ning multimodal large language models via embedded visual compression. InCVPR, 2025. 1

  41. [41]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191,

  42. [42]

    Towards semantic equivalence of tokenization in multimodal llm.arXiv preprint arXiv:2406.05127, 2024

    Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv:2406.05127, 2024. 8

  43. [43]

    Conical visual concentration for efficient large vision-language models

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concentration for efficient large vision-language models. InCVPR, 2025. 7, 8

  44. [44]

    freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024

    Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, and Yan Yan. freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024. 8

  45. [45]

    VisionZip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. In CVPR, 2025. 5, 7, 8

  46. [46]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V level mllm on your phone. arXiv:2408.01800, 2024. 8

  47. [47]

    ATP-LLaV A: Adaptive token pruning for large vision language models

    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InCVPR, 2025. 5, 7, 8

  48. [48]

    MM-Vet: evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: evaluating large multimodal models for integrated capabilities. InICML, 2024. 5

  49. [49]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 6

  50. [50]

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

    Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InICCV, 2025. 5, 7, 8

  51. [51]

    Improving open-ended text generation via adap- tive decoding

    Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adap- tive decoding. InICML, 2024. 8

  52. [52]

    Describe this image

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. LLaV A-Phi: Efficient multi-modal assistant with small language model. InInternational Workshop on Effi- cient Multimedia Computing under Limited, 2024. 8 A More Word-like Image Tokenization for MLLMs Supplementary Material A. Implementation Details Hyperparameters.Our implementation closely fo...