A More Word-like Image Tokenization for MLLMs
Pith reviewed 2026-05-20 12:42 UTC · model grok-4.3
The pith
Clustering image patches into semantic units produces fewer word-like visual tokens for multimodal models
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiVT replaces fixed-grid visual tokenization with a clustering step that groups patch embeddings into coherent semantic units. Each output token therefore corresponds to one visual concept instead of one spatial location. The clustering also adapts the token budget to the input image so that simpler scenes receive fewer tokens and complex scenes receive more, supplying an explicit accuracy-compute trade-off.
What carries the argument
Disentangled Visual Tokenization (DiVT), a post-encoder clustering operation that converts a dense grid of patch embeddings into a shorter sequence of semantically distinct tokens.
If this is right
- Multimodal benchmarks are solved at equal or higher accuracy with markedly fewer visual tokens.
- Memory footprint and inference latency drop in proportion to the reduction in token count.
- Token budget can be scaled directly with scene complexity to trade accuracy for speed.
- Visual inputs become more compatible with the discrete-token regime the language model was originally trained on.
Where Pith is reading between the lines
- The same clustering idea could be applied to video frames treated as extended patch sequences to control token growth over time.
- Lower token counts might allow higher-resolution inputs to be processed without quadratic growth in compute.
- Semantic clusters might serve as an interpretable intermediate representation for debugging what the model attends to.
Load-bearing premise
Grouping nearby patch embeddings into coherent semantic clusters produces tokens that the fixed language model can treat like discrete word units.
What would settle it
A controlled test on a benchmark of highly detailed images in which DiVT requires the same or greater number of tokens as a standard projector to reach equivalent accuracy, or in which the clustered tokens show no measurable increase in compatibility with the language model’s attention patterns.
Figures
read the original abstract
Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Disentangled Visual Tokenization (DiVT) for MLLMs. It clusters patch embeddings from a fixed vision encoder into coherent semantic units so that each visual token corresponds to a distinct concept rather than a grid cell, while also adapting the token budget to image complexity. The central claim is that this produces visual inputs more compatible with a frozen LLM's discrete token regime, allowing DiVT to match or exceed baseline performance on multimodal benchmarks with substantially fewer tokens and without any changes to the vision encoder or language model.
Significance. If the central claim holds, the work would be significant for efficient multimodal modeling: it offers an explicit accuracy-compute trade-off and a concrete mechanism for reducing memory and latency while preserving or improving downstream performance. The public release of code is a clear strength that enables direct verification of the clustering procedure and adaptive budget.
major comments (2)
- [Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.
- [Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.
minor comments (2)
- [Figures and Tables] Figure captions and tables should explicitly report the average and maximum token reduction percentages alongside the benchmark scores for direct comparison with baselines.
- [Method] Notation for the adaptive token budget (e.g., how complexity is estimated and the exact threshold function) should be formalized with an equation to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below and have revised the manuscript to incorporate additional evidence and details where appropriate.
read point-by-point responses
-
Referee: [Method and Experiments] The core assumption that clustering yields tokens the frozen LLM processes more like its original word tokens (rather than simply benefiting from shorter sequence length) is load-bearing for the entire contribution. No embedding-space statistics, attention-map comparisons, or ablation isolating semantic coherence versus length reduction are referenced in the provided description of the method or experiments; without such evidence the performance gains could be explained by the reduced token count alone.
Authors: We agree that direct evidence isolating semantic coherence from length reduction would strengthen the central claim. Our primary results already indicate that the benefit is not solely from shorter sequences, as DiVT with fewer tokens matches or exceeds the performance of standard full-token baselines (where simply reducing token count in fixed-grid methods typically degrades results). To address this explicitly, the revised manuscript now includes embedding-space similarity statistics between DiVT tokens and LLM word embeddings, attention-map comparisons demonstrating more concept-focused patterns, and an ablation contrasting semantic clustering against length-matched but non-semantic token reduction. revision: yes
-
Referee: [Abstract and §4] The abstract and method description state performance gains with fewer tokens but supply no quantitative details on the clustering algorithm, error bars, dataset splits, or ablation studies. This absence makes it impossible to assess whether post-hoc choices affect the reported robustness under limited token budgets.
Authors: We acknowledge that the abstract and initial method description lacked sufficient quantitative specifics. The revised manuscript expands these sections to report the clustering algorithm details (including the adaptive mechanism for determining cluster count based on image complexity and the specific hyperparameters used), includes error bars from multiple random seeds, clarifies the exact dataset splits for all benchmarks, and adds ablation studies examining sensitivity to clustering parameters and token budget choices to confirm robustness. revision: yes
Circularity Check
No circularity: empirical method proposal with no self-referential reductions or fitted predictions presented as derivations.
full rationale
The paper proposes DiVT as a clustering-based tokenization technique that adapts token count to image complexity, then reports empirical results on benchmarks. No equations, first-principles derivations, or mathematical claims are present in the provided text. The central claims rest on experimental comparisons rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed fitted parameters. The adaptive budget is described as an explicit design choice for accuracy-compute trade-off, not as a prediction derived from the same data used for evaluation. This is a standard self-contained empirical contribution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
clusters patch embeddings into coherent semantic units... similarity threshold θ... adaptive token budget
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamic token allocation... semantic granularity control
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv:2404.14219, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DivPrune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. DivPrune: Diversity-based visual token pruning for large multimodal models. InCVPR, 2025. 7, 8
work page 2025
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv:2309.16609, 2023. 1, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. PaliGemma: A versatile 3B VLM for transfer.arXiv:2407.07726, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023. 5
work page 2023
-
[6]
Honeybee: Locality-enhanced projector for multimodal LLM
Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal LLM. InCVPR, 2024. 1, 5, 6, 8
work page 2024
-
[7]
Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models. arXiv:2509.01552, 2025. 7, 8
-
[8]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal LLM’s referential dialogue magic.arXiv:2306.15195, 2023. 1, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InECCV, 2024. 5, 7, 8
work page 2024
-
[10]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
-
[11]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR, 2024. 8
work page 2024
-
[12]
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, and Sungwoong Kim. Slot-MLLM: Object-centric visual tokenization for multimodal LLM. arXiv:2505.17726, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 6
work page 2023
-
[14]
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. MobileVLM v2: Faster and stronger base- line for vision language model.arXiv:2402.03766, 2024. 1, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
InstructBLIP: Towards general-purpose vision- language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InNeurIPS, 2023. 1, 8
work page 2023
-
[16]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. 1
work page 2021
-
[17]
Layer- skip: Enabling early exit inference and self-speculative de- coding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. InACL, 2024. 8
work page 2024
-
[18]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Making the V in VQA matter: El- evating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: El- evating the role of image understanding in visual question answering. InCVPR, 2017. 5
work page 2017
-
[20]
Mamba: Linear-time sequence mod- eling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InConference on language modeling, 2024. 8
work page 2024
-
[21]
Lianyu Hu, Fanhua Shang, Liang Wan, and Wei Feng. iLLaV A: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv:2412.06263, 2024. 7, 8
-
[22]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
GQA: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InCVPR, 2019. 5
work page 2019
-
[24]
Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation em- powers large language models with image and video under- standing. InCVPR, 2024. 8
work page 2024
-
[25]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In ICML, 2023. 8
work page 2023
-
[26]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML,
-
[27]
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Ef- ficient visual projector for multimodal LLM.International Journal of Computer Vision, pages 1–19, 2025. 1, 5, 6, 8
work page 2025
-
[28]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InEMNLP, 2023. 5
work page 2023
-
[29]
VILA: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. VILA: On pre-training for vi- sual language models. InCVPR, 2024. 8
work page 2024
-
[30]
Boosting multimodal large language models with visual to- kens withdrawal for rapid inference
Zhihang Lin, Mingbao Lin, Luxi Lin, and Rongrong Ji. Boosting multimodal large language models with visual to- kens withdrawal for rapid inference. InAAAI, 2025. 7, 8
work page 2025
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 1, 5, 8
work page 2023
-
[32]
MMBench: Is your multi-modal model an all-around player? InECCV, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. MMBench: Is your multi-modal model an all-around player? InECCV, 2024. 3, 5
work page 2024
-
[33]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,
-
[34]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.arXiv:2304.07193, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1, 6
work page 2021
-
[36]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016. 2
work page 2016
-
[37]
LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive token reduction for efficient large multimodal models. InICCV, 2025. 5, 7, 8
work page 2025
-
[38]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR,
-
[39]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv:2403.08295, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
FlashSloth: Light- ning multimodal large language models via embedded visual compression
Bo Tong, Bokai Lai, Yiyi Zhou, Gen Luo, Yunhang Shen, Ke Li, Xiaoshuai Sun, and Rongrong Ji. FlashSloth: Light- ning multimodal large language models via embedded visual compression. InCVPR, 2025. 1
work page 2025
-
[41]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Towards semantic equivalence of tokenization in multimodal llm.arXiv preprint arXiv:2406.05127, 2024
Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, and Shuicheng Yan. Towards semantic equivalence of tokenization in multimodal llm. arXiv:2406.05127, 2024. 8
-
[43]
Conical visual concentration for efficient large vision-language models
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Conical visual concentration for efficient large vision-language models. InCVPR, 2025. 7, 8
work page 2025
-
[44]
Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, and Yan Yan. freePruner: A training-free approach for large multi- modal model acceleration.arXiv:2411.15446, 2024. 8
-
[45]
VisionZip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is better but not necessary in vision language models. In CVPR, 2025. 5, 7, 8
work page 2025
-
[46]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. MiniCPM-V: A GPT-4V level mllm on your phone. arXiv:2408.01800, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
ATP-LLaV A: Adaptive token pruning for large vision language models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. ATP-LLaV A: Adaptive token pruning for large vision language models. InCVPR, 2025. 5, 7, 8
work page 2025
-
[48]
MM-Vet: evaluating large multimodal models for integrated capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: evaluating large multimodal models for integrated capabilities. InICML, 2024. 5
work page 2024
-
[49]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 1, 6
work page 2023
-
[50]
Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InICCV, 2025. 5, 7, 8
work page 2025
-
[51]
Improving open-ended text generation via adap- tive decoding
Wenhong Zhu, Hongkun Hao, Zhiwei He, Yiming Ai, and Rui Wang. Improving open-ended text generation via adap- tive decoding. InICML, 2024. 8
work page 2024
-
[52]
Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. LLaV A-Phi: Efficient multi-modal assistant with small language model. InInternational Workshop on Effi- cient Multimedia Computing under Limited, 2024. 8 A More Word-like Image Tokenization for MLLMs Supplementary Material A. Implementation Details Hyperparameters.Our implementation closely fo...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.