pith. sign in

arxiv: 2602.04802 · v2 · pith:AQ7X7DDAnew · submitted 2026-02-04 · 💻 cs.CV

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Pith reviewed 2026-05-21 13:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsvisualized textmodality gapbenchmarkimage-embedded textmultimodal evaluationVLM performance
0
0 comments X

The pith

Vision-language models perform substantially worse when the same text appears inside images instead of as plain text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates VISTA-Bench to compare how vision-language models answer questions when the content is given as ordinary text versus when that same content is rendered as text inside an image. It runs the same semantic questions under matched conditions so that only the input format changes. Across more than thirty models the results show clear drops in accuracy on the image version, and the drop grows larger when the rendered text becomes harder to read. The work therefore argues that current models lack a unified way of handling language whether it arrives as tokens or as pixels. This points to a basic limitation that affects any real-world task where text appears in photographs or documents.

Core claim

The paper's central claim is that vision-language models exhibit a pronounced modality gap: they achieve strong results on pure-text questions yet degrade markedly when the identical semantic content is presented as visualized text embedded in images. The gap widens under greater perceptual difficulty even though the underlying meaning remains unchanged. VISTA-Bench supplies controlled test cases spanning perception, reasoning, and unimodal understanding to expose this inconsistency.

What carries the argument

VISTA-Bench, a benchmark that pairs pure-text questions with equivalent visualized-text questions rendered under controlled conditions to isolate the effect of presentation format.

If this is right

  • Model training must explicitly include varied renderings of text in images to close the observed performance gap.
  • Real-world applications that rely on reading text from scenes or documents will inherit the same limitations shown on the benchmark.
  • Future evaluation suites should routinely test both tokenized and pixel-based language input to measure true cross-modal consistency.
  • Architectures that treat text and rendered text as separate streams may need redesign toward a single internal representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap likely stems from training data that under-represents text appearing inside complex visual scenes.
  • Systems designed for document or scene understanding may require additional preprocessing steps to convert visualized text back to plain form.
  • The benchmark could be extended to measure how the gap changes with different fonts, backgrounds, or partial occlusions.

Load-bearing premise

The pure-text and visualized-text versions of each question keep exactly the same meaning and difficulty level, with no extra changes introduced by how the images are generated or how the text is placed inside them.

What would settle it

A set of experiments in which the same models achieve statistically identical accuracy on the pure-text and visualized-text halves of VISTA-Bench would falsify the claimed modality gap.

Figures

Figures reproduced from arXiv: 2602.04802 by Haiwen Diao, Huchuan Lu, Juntong Feng, Qing'an Liu, Xinzhe Han, Yue Zhu, Yuhao Wang, Yujie Cheng, Yunzhi Zhuge.

Figure 1
Figure 1. Figure 1: (a) Humans integrate visual context with embedded text, whereas standard VLM evaluation provides language as discrete tokens. (b) Presenting language as visualized text can induce behavioral deviations from pure-text inputs, revealing a modality gap. renders text into diverse visual layouts, and enforces fidelity via VLM-based verification. It contains multiple-choice questions and organizes them under a h… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Perceptual factor impact. Top: Font Size (9, 16, 32, 48, 64). Bottom: Font Style (Arial, Cambria, Roman, Brush). Conversely, very short or highly structured prompts, such as Chain-of-Thought, can hurt performance. CoT is especially harmful for InternVL-3.5-8B, often producing irrelevant out￾put instead of an answer. Overall, Qwen3-VL-8B-Instruct is stable across prompts, while InternVL-3.5-8B shows larger … view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the construction. First, we extract filtered dataset from existing data rely on diversity and accuracy. Second, we transform text into visualized text through the rendering pipeline. We then validate the precision of visualized text depends on VLM and continuously refine the pipeline. Through this process, we finally establish VISTA-Bench, supported by a sophisticated rendering pipeline. VISTA-… view at source ↗
Figure 5
Figure 5. Figure 5: Ability dimensions in VISTA-Bench. VISTA-Bench includes two main levels of dimensions based on Inherent Modality Dependence and Cognitive Dimension, with 10 distinct abilities. supporting visual evidence. We organize queries into two domains: (i) STEM & Health, covering scientific and medi￾cal knowledge and (ii) Social-Humanities & Management, covering cultural and organizational knowledge. It measures whe… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative attention visualization analysis of models with disparate OCR capabilities under various rendering configurations. Instruct with a 2.0-point drop. Overall, these results indicate that high pure-text performance does not reliably translate to robust visualized-text understanding. Within multimodal tasks, the modality gap is most pro￾nounced for reasoning and knowledge, while perception is relati… view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of Prompt Design. Prompt: 10-words, 20-words, 50-words, image understanding and CoT. Prompts vary in length and reasoning emphasis, enabling an analysis of how instruction detail and presentation style impact model behavior, as illustrated below. All prompts are displayed in a uniform visual format using the promptbox. • 10-words prompt. A minimal, highly concise instruction designed to provide only… view at source ↗
Figure 10
Figure 10. Figure 10: A successful Qwen-Image-Edit case under the visualized-text setting. The model correctly generates readable visualized text in the designated region and produces the correct final answer. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rendering sensitivity study on eight additional representative models. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualized examples for Multimodal Perception task. Top: Attribute Perception (Times New Roman, 9pt). Middle: Global Perception (Brush Script MT, 32pt). Bottom: Instance Perception (Times New Roman, 16pt). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualized examples for Multimodal Reasoning task. Top: Logical Reasoning (Arial, 16pt). Middle: Spatial & Relation (Cambria, 32pt). Bottom: Cross-Instance (Cambria, 48pt). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualized examples for Multimodal Knowledge task. Top: STEM & Health (Arial, 32pt). Middle: Social-Humanities & Management (Cambria, 9pt). Bottom: STEM & Health (Brush Script MT, 9pt). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visualized examples for Unimodal Knowledge task. Top: Applied Sciences & Social (Times New Roman, 48pt). Bottom: Natural & Life Sciences (Brush Script MT, 32pt). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Mathematical formula rendering error. Config: Arial, 16pt 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Code structure rendering error.Config: Arial, 9pt 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Handwritten font rendering error.Config: Brush, 32pt 26 [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Top: Rendering correct example.Config: Arial, 32pt. Bottom: Rendering correct example.Config: Cambria, 16pt 27 [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 30 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VISTA-Bench, a benchmark that contrasts pure-text queries with equivalent semantic content rendered as visualized text in images under controlled conditions. It evaluates over 30 VLMs across perception, reasoning, and unimodal domains, reporting a consistent modality gap in which models degrade substantially on visualized text, with the gap widening under increased perceptual difficulty.

Significance. If the equivalence between pure-text and visualized-text instances holds, the work would be significant for identifying a practical limitation in current VLMs for real-world text-in-image scenarios such as document understanding and scene text. The scale of the evaluation across many models provides a broad empirical basis, and the benchmark offers a structured framework for diagnosing and improving unified language representations across text and pixels.

major comments (2)
  1. [§3] §3 (dataset construction): The description of 'controlled rendering conditions' to preserve identical semantics, lexical content, and perceptual difficulty provides no quantitative verification (e.g., legibility scores, OCR accuracy on rendered images, contrast metrics, or human difficulty ratings) that text placement, font metrics, anti-aliasing, or background choices leave question difficulty unchanged. Any systematic shift would directly inflate the reported modality gap without reflecting a true cross-modal difference.
  2. [Evaluation and results sections] Evaluation and results sections: The central claim of a pronounced modality gap relies on post-hoc choices in question construction and rendering pipeline being free of confounds, yet the manuscript supplies insufficient detail on statistical controls, exact matching procedures, or ablation of rendering parameters to establish that the observed degradation is attributable to modality rather than uncontrolled variables.
minor comments (1)
  1. [Figures and §3] Figure captions and rendering pipeline description could more explicitly list all controlled parameters (font size, contrast ratio, placement jitter) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, clarifying our approach and indicating the specific revisions incorporated to strengthen the presentation of controlled conditions and statistical rigor.

read point-by-point responses
  1. Referee: [§3] §3 (dataset construction): The description of 'controlled rendering conditions' to preserve identical semantics, lexical content, and perceptual difficulty provides no quantitative verification (e.g., legibility scores, OCR accuracy on rendered images, contrast metrics, or human difficulty ratings) that text placement, font metrics, anti-aliasing, or background choices leave question difficulty unchanged. Any systematic shift would directly inflate the reported modality gap without reflecting a true cross-modal difference.

    Authors: We agree that explicit quantitative verification strengthens the claim of equivalence. In the revised manuscript, Section 3 now includes: (i) OCR accuracy measured with a commercial OCR engine averaging 98.7% across all visualized-text instances; (ii) per-image contrast ratios computed via RMS contrast, with all images meeting a minimum threshold of 0.4; and (iii) a human perceptual-difficulty study (n=40 participants) showing no statistically significant difference in rated difficulty between matched pure-text and visualized-text pairs (paired t-test, p=0.42). These metrics confirm that rendering choices do not systematically alter question difficulty. revision: yes

  2. Referee: [Evaluation and results sections] Evaluation and results sections: The central claim of a pronounced modality gap relies on post-hoc choices in question construction and rendering pipeline being free of confounds, yet the manuscript supplies insufficient detail on statistical controls, exact matching procedures, or ablation of rendering parameters to establish that the observed degradation is attributable to modality rather than uncontrolled variables.

    Authors: We have expanded the Evaluation section with additional detail on the matching pipeline: semantic equivalence was enforced via cosine similarity >0.95 on sentence embeddings followed by manual review by two authors, with inter-annotator agreement of 0.91. We further added an ablation subsection reporting performance under controlled variations of font size, background contrast, and anti-aliasing strength; the modality gap remains stable (average drop 18–27%) across all parameter settings, indicating the degradation is not driven by any single rendering choice. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no circular derivation

full rationale

The paper introduces VISTA-Bench as a new dataset and evaluation protocol for comparing VLM performance on pure-text versus visualized-text queries. All reported results consist of direct accuracy measurements on external models (over 30 VLMs) using the constructed test cases. No equations, fitted parameters, or first-principles derivations appear; the modality-gap claim is an observed empirical difference rather than a quantity defined in terms of itself. Dataset construction under 'controlled rendering conditions' is described procedurally and remains open to external verification or falsification, satisfying the criterion for a self-contained benchmark study against independent model outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that visualized-text and pure-text versions are semantically equivalent and that rendering variations can be isolated; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Semantic equivalence holds between pure-text and rendered visualized-text questions under controlled conditions.
    Invoked to ensure the modality gap reflects input format rather than content differences.
  • standard math Standard VLM evaluation protocols apply without additional biases from image rendering.
    Background assumption for fair comparison across models.

pith-pipeline@v0.9.0 · 5726 in / 1212 out tokens · 25826 ms · 2026-05-21T13:31:59.526125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

    cs.CV 2026-04 unverdicted novelty 6.0

    DRAGON is a new benchmark with 11,664 annotated instances from six diagram QA datasets that requires models to localize visual evidence regions supporting their answers.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    V oqa: Visual-only question answering.arXiv preprint arXiv:2505.14227, 2025a

    An, J., Jiang, L., Luo, J., Wu, W., and Huang, L. V oqa: Visual-only question answering.arXiv preprint arXiv:2505.14227, 2025a. An, X., Xie, Y ., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y ., Xu, S., Chen, C., Zhu, D., et al. Llava- onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025b. Bai,...

  2. [2]

    Glyph: Scaling context windows via visual-text compres- sion.arXiv preprint arXiv:2510.17800, 2025

    Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X., Liu, X., et al. Glyph: Scal- ing context windows via visual-text compression.arXiv preprint arXiv:2510.17800,

  3. [3]

    From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025a

    Diao, H., Li, M., Wu, S., Dai, L., Wang, X., Deng, H., Lu, L., Lin, D., and Liu, Z. From pixels to words–towards native vision-language primitives at scale.arXiv preprint arXiv:2510.14979, 2025a. Diao, H., Li, X., Cui, Y ., Wang, Y ., Deng, H., Pan, T., Wang, W., Lu, H., and Wang, X. Evev2: Improved baselines for encoder-free vision-language models.arXiv ...

  4. [4]

    The Llama 3 Herd of Models

    URL https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf . Techni- cal Report. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

  6. [6]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006,

  7. [7]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Ge, Y ., Ge, Y ., Wang, G., Wang, R., Zhang, R., and Shan, Y . Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13299–13308, 2024a. Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevi...

  8. [8]

    Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279,

    Li, Y ., Lan, Z., and Zhou, J. Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279,

  9. [9]

    Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 26296–26306, 2024a. Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around ...

  10. [10]

    Lu, S., Li, Y ., Xia, Y ., Hu, Y ., Zhao, S., Ma, Y ., Wei, Z., Li, Y ., Duan, L., Zhao, J., et al. Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737,

  11. [11]

    MiMo-VL technical report

    URL https: //arxiv.org/abs/2506.03569. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  12. [12]

    Kimi-VL Technical Report

    Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491,

  13. [13]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265,

  14. [14]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234,

  15. [15]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y ., et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324,

  16. [16]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., et al. Deepseek-vl2: Mixture- of-experts vision-language models for advanced multi- modal understanding.arXiv preprint arXiv:2412.10302,

  17. [17]

    J., Yan, R., Qu, H., Li, Z., and Tang, J

    Xing, L., Wang, A. J., Yan, R., Qu, H., Li, Z., and Tang, J. See the text: From tokenization to visual reading.arXiv preprint arXiv:2510.18840,

  18. [18]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  19. [19]

    Sail-vl2 technical report.arXiv preprint arXiv:2509.14033, 2025

    Yin, W., Ye, Y ., Shu, F., Liao, Y ., Kang, Z., Dong, H., Yu, H., Yang, D., Wang, J., Wang, H., et al. Sail-vl2 technical report.arXiv preprint arXiv:2509.14033,

  20. [20]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., Cai, T., Chen, W., Huang, Y ., Zhao, Y ., et al. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154,

  21. [21]

    Vtcbench: Can vision-language models understand long context with vision-text compression? arXiv preprint arXiv:2512.15649,

    Zhao, H., Wang, M., Zhu, F., Liu, W., Ni, B., Zeng, F., Meng, G., and Zhang, Z. Vtcbench: Can vision-language models understand long context with vision-text compression? arXiv preprint arXiv:2512.15649,

  22. [22]

    Benchmark Details A.1

    10 VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? Appendix A. Benchmark Details A.1. Complete Category Taxonomy The VISTA-Bench dataset is organized into four primary categories, each targeting distinct capabilities under visualized text. Table 2 provides the hierarchical distribution of samples across categ...

  23. [23]

    thinking

    LLaV A series.We evaluate three representative LLaV A variants: LLaV A-1.5-7B, LLaV A-OneVision-7B and LLaV A- OneVision-1.5-8B. For LLaV A-1.5-7B, we do not rely on the original LLaV A codebase; instead, we use the official HuggingFace Transformers implementation (processor + LlavaForConditionalGeneration) to ensure a unified and re- producible inference...