arxiv: 2604.09687 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Yunkai Zhang , Linda Li , Yingxin Cui , Xiyuan Ruan , Zeyu Zheng , Kezhen Chen , Yi Zhang , Diji Yang This is my paper

Pith reviewed 2026-05-10 18:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsdigital agnosiagrid2matrix benchmarkvisual detail preservationmultimodal reasoningzero-shot evaluationpatch boundary effects

0 comments

The pith

Vision-language models lose fine grid details when converting colors to numbers because a gap opens between what their visual encoders retain and what their language output can express.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Grid2Matrix, a benchmark that shows a model a color grid plus a simple color-to-number key and asks it to output the matching matrix. Models from two major families collapse to near-zero accuracy on grids as small as 4x4 or 5x5 in zero-shot use, rather than declining smoothly with added density. Separate probes of the vision encoders recover substantially more of the original grid layout than the full model produces, isolating the loss to the step that turns visual features into language. The authors label this recoverable-but-unexpressed gap Digital Agnosia and trace many errors to how grid cells cross the fixed patch boundaries used by the vision backbone. Scaling the models or retraining for better multimodal alignment leaves the failure largely intact.

Core claim

We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the gap,

What carries the argument

Grid2Matrix benchmark, which forces exhaustive visual-to-matrix readout on grids of controlled size and color count, together with the identified Digital Agnosia gap between preserved visual features and final language output.

If this is right

VLMs cannot be trusted for tasks that require reading every cell of a table, chart, form, or GUI interface.
Visual encoders must be probed separately from the language head to determine how much detail actually reaches the output stage.
Many errors align with the fixed patch boundaries of the vision backbone rather than with grid content itself.
Neither larger model size nor additional multimodal alignment training removes the failure mode on this class of inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar controlled grids could be used to test whether other structured visual inputs, such as diagrams or maps, trigger the same encoder-to-language drop.
Training objectives that explicitly reward cell-by-cell reconstruction might close the gap without requiring larger models.
Systems that route visual features directly to an external symbolic decoder could bypass the language-side loss observed here.

Load-bearing premise

That varying grid size and number of colors raises visual complexity without semantic side effects, and that probing the visual encoder alone accurately measures what grid information is still available before the language head acts.

What would settle it

An experiment in which a model's end-to-end accuracy on grids matches or exceeds the accuracy of its isolated visual encoder probe on the same inputs, or in which accuracy remains high on grids larger than 6x6 without the reported early collapse.

Figures

Figures reproduced from arXiv: 2604.09687 by Diji Yang, Kezhen Chen, Linda Li, Xiyuan Ruan, Yingxin Cui, Yi Zhang, Yunkai Zhang, Zeyu Zheng.

**Figure 1.** Figure 1: Samples from G2M. Difficulty ranges from (a) simple tests to (b) dense settings that exceed standard patch resolution. For example, the 3 × 3 grid in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Cell accuracy of open-weight models under zero-shot generation and VE probing. While zero-shot performance collapses rapidly as grid size increases, substantially higher accuracy remains recoverable from frozen VE features under supervised probing. The systemic failure across model families on grids as small as 9 × 9 indicates a severe limitation in end-to-end dense spatial transcription. However, this … view at source ↗

**Figure 3.** Figure 3: Diagnostic spatial heatmaps for proprietary models for grid sizes from 9 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Diagnostic spatial heatmaps for open-weight models for grid sizes from 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Diagnostic spatial heatmaps for open-weight model VEs (via Spatial Probing) for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Grid-patch alignment across varying grid sizes. Performance generally peaks at [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Zero-shot scaling on open-weight models for 9 × 9 and 12 × 12 grids. Case 1: Qwen3-VL — LLM Scaling Offsets a Weaker VE. For Qwen3-VL, larger end-to-end models yield strictly better zeroshot accuracies (e.g., 32B drastically outperforms 8B). Spatial probing, however, reveals the opposite pattern in the VE: the smaller 300M encoder used in the 2B and 4B models retains higher spatial accuracy (81.59%) th… view at source ↗

**Figure 8.** Figure 8: Error heatmaps from spatial probes on open-weight models at 48 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Error heatmaps comparing vanilla and extracted vision encoders at 64 [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Error heatmaps for 47 × 47, 48 × 48, 64 × 64, and 65 × 65. Yellow lines show the patch boundaries. For better visibility, we show only the center 8 × 8 patches. See Figures 14 and 15 for the full heatmaps. Int-Int Int-Edg Int-Cro Edg-Edg Edg-Cro Cro-Cro 45 50 55 60 65 70 75 80 Cell Accuracy (%) Qwen3-VL-8B-Instruct Int-Int Int-Edg Int-Cro Edg-Edg Edg-Cro Cro-Cro 55 60 65 70 75 80 85 90 InternVL3.5-8B Grid… view at source ↗

**Figure 11.** Figure 11: Cell accuracy across six boundary-interaction types near the 64 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Color analysis heatmap illustrating per-color localization performance in [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Scaling on proprietary models on 12 × 12 and 20 × 20 grids. multimodal instruction tuning: in some of the training tasks that are highly structured (such as documents, charts, and GUIs), White overwhelmingly functions as negative background space. Consequently, when the LLM’s spatial tracking becomes overwhelmed by the grid density, it appears to default to prioritizing salient foreground colors like Gree… view at source ↗

**Figure 14.** Figure 14: Grid-patch interaction patterns in InternVL3.5-8B. We present error heatmaps [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Grid-patch interaction patterns in Qwen3-VL-8B-Instruct. We present error [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Zero-shot spatial error heatmaps for all proprietary models, a full version of [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Zero-shot spatial error heatmaps for Qwen3-VL-8B-Instruct across varying grid [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Zero-shot spatial error heatmaps for InternVL3.5-8B. Similar to Qwen3-VL, [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: VE spatial probing heatmaps for Qwen3-VL-8B-Instruct. While the blind spots [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: VE spatial probing heatmaps for InternVL3.5-8B. While the blind spots persist [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Color analysis heatmap illustrating per-color localization performance in Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: Color analysis heatmap illustrating per-color localization performance in GPT-5- [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

**Figure 23.** Figure 23: Color analysis heatmap illustrating per-color localization performance in Gemini [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Zero-shot spatial error heatmaps for the InternVL3.5 family across varying model [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: Zero-shot spatial error heatmaps for the Qwen3-VL family. In contrast to InternVL, [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: VE spatial probing heatmaps for the InternVL3.5 family across varying grid [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗

**Figure 27.** Figure 27: VE spatial probing heatmaps for the Qwen3-VL family. The 4B model uses the [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

**Figure 28.** Figure 28: VE spatial probing heatmaps for the Qwen3-VL family, comparing the isolated [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗

**Figure 29.** Figure 29: VE spatial probing heatmaps for the InternVL3.5 family, comparing the pre [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complexity while minimizing semantic confounds. We find that VLMs exhibit a sharp early collapse in zero-shot end-to-end evaluation, failing on surprisingly small grids rather than degrading gradually as the task becomes denser. We probe the visual encoders of VLMs from two representative families and find that they preserve substantially more of the grid information than the corresponding end-to-end outputs. This suggests that the failure is not explained by visual encoding alone, but also reflects a gap between what remains recoverable from visual features and what is ultimately expressed in language. We term this gap \textit{Digital Agnosia}. Further analyses show that these errors are highly structured and depend strongly on how grid cells overlap with visual patch boundaries. We also find that common strategies such as model scaling and multimodal alignment do not fully eliminate this failure mode. We expect G2M to serve as a useful testbed for understanding where and how VLMs lose fine visual details, and for evaluating tasks where missing even small visual details can matter, such as tables, charts, forms, and GUIs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grid2Matrix gives a clean benchmark for VLM detail loss on grids, but the Digital Agnosia gap looks overstated because supervised probes are compared to zero-shot end-to-end outputs.

read the letter

The key takeaway from this paper is that VLMs hit a wall on outputting precise matrix representations from color grids much sooner than you'd expect from gradual degradation, and the authors point to a disconnect between what the vision encoder captures and what the language head produces. They do a good job setting up Grid2Matrix as a minimal task that forces exhaustive visual readout without semantic shortcuts. Varying the grid size and color count lets them dial up complexity cleanly. The finding that errors align with patch boundaries is concrete and ties back to how these models process images. Probing the encoders to show they retain more grid info than the full model outputs is a reasonable way to localize the issue, and they check that scaling and alignment don't solve it. Where it gets shaky is the apples-to-oranges part of the comparison. The end-to-end results are strictly zero-shot, but the visual probes are likely supervised models trained on the features. That difference in optimization and access to labels can explain a lot of the performance gap without needing to invoke a special 'Digital Agnosia' in the language pathway. The stress-test note flags this, and the abstract doesn't clarify if the probes are zero-shot or not. If the paper has details on probe training, that would help, but based on what's here, it undercuts the central claim a bit. The structured error patterns are still interesting on their own. This paper is aimed at researchers studying fine-grained visual understanding in VLMs, especially for applications like parsing tables or interfaces where missing details matters. It has enough new empirical content and a useful benchmark to warrant peer review, though revisions would probably focus on making the probing protocol match the zero-shot setting more closely or clarifying the supervision levels. I'd recommend sending it out for review rather than desk rejecting.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Grid2Matrix (G2M) benchmark, in which VLMs receive a color grid plus a color-to-number mapping and must output the corresponding numeric matrix. By varying grid size and color count, the authors observe a sharp collapse in zero-shot end-to-end accuracy on surprisingly small grids. Separate probing of the visual encoders recovers substantially more grid information than the language outputs, which the authors interpret as evidence of a gap they term 'Digital Agnosia'. Errors are shown to align with patch boundaries, and neither model scaling nor multimodal alignment fully resolves the failure.

Significance. If the central empirical pattern is robust, G2M supplies a simple, low-semantic-confound testbed for diagnosing where VLMs lose fine visual structure before it reaches the language head. The structured error analysis tied to patch boundaries and the explicit comparison between end-to-end prompting and visual probing are concrete strengths that could guide future work on tables, charts, and GUI understanding.

major comments (2)

[Methods / Probing subsection] Methods / Probing subsection: the visual-probing results rely on trained classifiers or decoders fit to the frozen visual features, while the end-to-end evaluation remains strictly zero-shot prompting. This supervised-versus-unsupervised difference supplies an alternative explanation for the reported performance gap and therefore weakens the direct attribution to an intrinsic 'Digital Agnosia' between vision and language.
[Results, zero-shot evaluation paragraph] Results, zero-shot evaluation paragraph: the claim that failure occurs on 'surprisingly small grids' and does not degrade gradually requires the exact grid sizes, color counts, and accuracy curves (including confidence intervals) to be shown; without these numbers it is difficult to judge whether the collapse is as abrupt as asserted or whether it simply tracks the point at which patch-level information becomes insufficient.

minor comments (2)

[Introduction] The introduction of the term 'Digital Agnosia' would benefit from a brief discussion of how it differs from or overlaps with existing notions of visual grounding failure or detail loss already reported in the VLM literature.
[Figures] Figure captions should explicitly state the number of models, grid sizes, and color cardinalities shown in each panel so that readers can interpret the plots without returning to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses

Referee: [Methods / Probing subsection] Methods / Probing subsection: the visual-probing results rely on trained classifiers or decoders fit to the frozen visual features, while the end-to-end evaluation remains strictly zero-shot prompting. This supervised-versus-unsupervised difference supplies an alternative explanation for the reported performance gap and therefore weakens the direct attribution to an intrinsic 'Digital Agnosia' between vision and language.

Authors: We acknowledge that the supervised nature of the visual probes (trained classifiers on frozen features) versus the strictly zero-shot end-to-end prompting is a methodological distinction that could contribute to the observed gap and should not be overlooked in attributing the failure to an intrinsic vision-language disconnect. Probing is intended to establish an upper bound on recoverable grid information from the visual encoder, but we agree this does not equate to a direct comparison. We will revise the Methods / Probing subsection to explicitly discuss this difference, qualify the interpretation of Digital Agnosia, and add a caveat that part of the gap may stem from the evaluation protocol rather than solely from an intrinsic bottleneck. We will also emphasize that the substantial information recovery via probing still indicates the visual features encode the grid structure beyond what zero-shot language outputs utilize. revision: partial
Referee: [Results, zero-shot evaluation paragraph] Results, zero-shot evaluation paragraph: the claim that failure occurs on 'surprisingly small grids' and does not degrade gradually requires the exact grid sizes, color counts, and accuracy curves (including confidence intervals) to be shown; without these numbers it is difficult to judge whether the collapse is as abrupt as asserted or whether it simply tracks the point at which patch-level information becomes insufficient.

Authors: We agree that explicit quantitative details are required to substantiate the claim of an abrupt collapse on small grids rather than gradual degradation. We will add a new table in the Results section listing the exact grid sizes tested, corresponding color counts, zero-shot accuracies, and 95% confidence intervals. We will also expand the zero-shot evaluation paragraph to directly reference this table and discuss the relationship to patch boundaries. This will allow readers to assess the sharpness of the performance drop and its alignment with visual patch constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and probing study with independent measurements.

full rationale

The paper introduces the Grid2Matrix benchmark, runs zero-shot end-to-end VLM evaluations, and trains separate probes on visual encoder features to recover grid information. No equations, derivations, fitted parameters renamed as predictions, or self-citations are used to justify the central claim of a recoverable-vs-expressed gap (termed Digital Agnosia). All results follow directly from the described experimental protocol on held-out grids; the comparison between probing accuracy and zero-shot output is an external measurement rather than a self-referential reduction. Minor self-citations, if present, are not load-bearing for the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the benchmark design for isolating visual complexity and on the assumption that encoder probing faithfully reflects preserved information.

axioms (1)

domain assumption VLMs can be meaningfully decomposed into visual encoders and language components for separate probing.
Invoked when comparing encoder outputs to end-to-end performance.

invented entities (1)

Digital Agnosia no independent evidence
purpose: To label the observed gap between recoverable visual features and language-expressed output.
New term coined to describe the phenomenon identified in the experiments.

pith-pipeline@v0.9.0 · 5599 in / 1508 out tokens · 54230 ms · 2026-05-10T18:41:00.514263+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Reshape:We reshape the sequence into a 2D spatial feature map of size (B, D, h, w), whereh=w= √ L

work page
[2]

Interpolate:Because the native feature map resolution ( h×w ) cannot natively match every target grid density (N×N ), we apply bilinear interpolation to resize the feature map to the exactN×Ndimensions

work page
[3]

This is mathematically equivalent to training a shared linear classifier that sweeps across and operates on every grid cell independently

Classification:We project the features using a 1 × 1 Convolutional Head. This is mathematically equivalent to training a shared linear classifier that sweeps across and operates on every grid cell independently. Y=Softmax(Conv 1×1(GELU(BN(Conv1×1(Xresized))))) The probe head projects the input dimension to a hidden dimension of 512, followed by Batch Norm...

work page
[4]

Scan the grid row by row, from top to bottom

work page
[5]

For each row, map every cell using the color mapping

work page
[6]

Output Format: Return ONLY a Python list of lists (e.g., [[0, 1], [2, 0]])

Ensure the output has exactlyHrows andWcolumns. Output Format: Return ONLY a Python list of lists (e.g., [[0, 1], [2, 0]]). Do not use markdown, code blocks, or explanations. To prevent prompt instability, the color mapping dictionary is strictly sorted by its integer values before being injected into the prompt template. We enforce fully deterministic ge...

work page 2048
[7]

Strict Evaluation:We first attempt to directly parse the output stream as a valid Python array usingast.literal eval

work page
[8]

Row-wise Regex:If standard parsing fails, we apply regular expressions to catch row-by-row structural patterns (e.g., ROW1=[...]), a format some models sponta- neously adopt when overwhelmed

work page
[9]

If the total integer count perfectly matches the expected H×W cells, we sequentially reshape this 1D array into the target 2D grid dimensions

Fallback Flattening:As a final contingency, we extract all integers from the text response. If the total integer count perfectly matches the expected H×W cells, we sequentially reshape this 1D array into the target 2D grid dimensions. If all three extraction methods fail, the sample is recorded as a complete parse error and yields a 0% Cell Accuracy for t...

work page