Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279

Yanhong Li, Zixuan Lan, Jiawei Zhou · 2025 · arXiv 2510.18279

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Visual Text Compression as Measure Transport

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using 10% fewer tokens.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

cs.CV · 2026-02-04 · conditional · novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.

Token-Efficient Multimodal Reasoning via Image Prompt Packaging

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

IPPg embeds text into images to reduce multimodal model inference costs by 35.8-91% with competitive accuracy on many VQA and code benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Visual Text Compression as Measure Transport cs.CV · 2026-05-06 · unverdicted · none · ref 23
Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using 10% fewer tokens.
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? cs.CV · 2026-02-04 · conditional · none · ref 8
VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion cs.CV · 2026-05-28 · unverdicted · none · ref 19
LoMo is a lightweight data curation technique that locally substitutes text with images in prompts to enforce cross-modal invariance, yielding 2.67-2.82 point gains over standard SFT on two VLMs across 13 benchmarks.
Token-Efficient Multimodal Reasoning via Image Prompt Packaging cs.CV · 2026-04-02 · unverdicted · none · ref 10
IPPg embeds text into images to reduce multimodal model inference costs by 35.8-91% with competitive accuracy on many VQA and code benchmarks.

Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer