Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Pith reviewed 2026-05-21 12:36 UTC · model grok-4.3
The pith
Vision-language models route OCR signals through different layers depending on how they integrate vision and language.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models show peak sensitivity at mid-depth for scene text, while single-stage projection models peak at early layers. The OCR signal is remarkably low-dimensional with PC1 capturing up to 72.9 percent of variance, and PCA directions learned on one dataset transfer to others. In models with modular OCR circuits, OCR removal can improve counting performance up to 6.9 percentage points.
What carries the argument
Activation-difference maps obtained by subtracting representations of text-inpainted images from those of original images, used to localize OCR routing within the vision-language fusion layers.
If this is right
- OCR processing depth varies systematically with whether a model uses deep stacking or early projection for vision-language fusion.
- The dominant OCR direction in activation space is low-dimensional and reusable across different image datasets.
- Suppressing the OCR component can raise accuracy on counting tasks in architectures that keep OCR pathways modular.
- Principal components derived from one dataset can be applied to intervene on OCR processing in new datasets.
Where Pith is reading between the lines
- The interference between OCR and counting suggests that explicit separation of text and non-text visual streams could reduce task conflicts in future VLMs.
- The transferable low-dimensional OCR direction offers a lightweight way to edit or monitor text-processing behavior without retraining the full model.
- The same activation-difference approach could map routing for other visual skills such as object counting or spatial reasoning.
- If the early versus mid-depth pattern generalizes, model designers could predict and adjust OCR placement before full-scale training.
Load-bearing premise
Text-inpainting removes OCR cues while leaving all other visual features and the model's overall behavior unchanged.
What would settle it
If peak sensitivity layers shift substantially when the same images are inpainted with a different technique or when tested on a new set of text-containing scenes, the claimed architecture-specific bottleneck locations would not hold.
read the original abstract
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures up to 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that activation differences between original images and text-inpainted versions reveal architecture-specific OCR routing bottlenecks in VLMs (Qwen3-VL, Phi-4, InternVL3.5): DeepStack models peak at mid-depth (~50%) while projection models peak early (layers 6-25%), the OCR signal is low-dimensional (PC1 up to 72.9% variance), PCA directions transfer across datasets, and OCR removal improves counting performance (up to +6.9 pp) in modular architectures.
Significance. If the intervention cleanly isolates OCR, the results would provide a useful empirical map of how vision-language integration strategies affect text routing, plus evidence for shared low-dimensional pathways and a counter-intuitive benefit of OCR removal in modular models. The cross-dataset PCA transfer and performance delta are concrete, falsifiable observations that strengthen the mechanistic contribution.
major comments (2)
- [Methods / Intervention description] The central intervention relies on text-inpainting to isolate OCR signals, yet the manuscript provides no details on region detection, inpainting algorithm, or controls confirming that low-level image statistics (edges, textures, lighting) are preserved. Without such verification, activation deltas at early layers (reported for Phi-4 and InternVL) could reflect visual-feature changes rather than OCR removal, directly undermining the architecture-specific bottleneck claims.
- [Results / Layer-wise activation analysis] Table or figure reporting layer-wise effects: the exact layer indices and dataset sizes for the peak-sensitivity results (mid-depth in Qwen vs. 6-25% in others) are not stated with sufficient precision or statistical controls, making it difficult to assess whether the reported architecture dependence is robust or sensitive to post-hoc layer selection.
minor comments (1)
- [Abstract] The abstract states 'up to 72.9%' for PC1 variance; reporting the range across all models and datasets, plus the exact number of images used for each PCA, would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify important opportunities to improve the transparency of our intervention and the precision of our results reporting. We address each point below and have revised the manuscript to incorporate the requested clarifications and additional details.
read point-by-point responses
-
Referee: [Methods / Intervention description] The central intervention relies on text-inpainting to isolate OCR signals, yet the manuscript provides no details on region detection, inpainting algorithm, or controls confirming that low-level image statistics (edges, textures, lighting) are preserved. Without such verification, activation deltas at early layers (reported for Phi-4 and InternVL) could reflect visual-feature changes rather than OCR removal, directly undermining the architecture-specific bottleneck claims.
Authors: We agree that the current description of the text-inpainting procedure lacks sufficient detail. In the revised manuscript we have added a dedicated subsection in Methods that specifies the region detection pipeline (a fine-tuned text detector to produce bounding boxes), the inpainting implementation (a latent diffusion model conditioned on surrounding context), and quantitative controls. These include side-by-side comparisons of edge histograms (Canny), local contrast statistics, and global intensity distributions, all of which show negligible differences between original and inpainted images. The added verification supports that the reported activation differences at early layers are attributable to OCR removal rather than low-level visual alterations, thereby reinforcing rather than undermining the architecture-specific claims. revision: yes
-
Referee: [Results / Layer-wise activation analysis] Table or figure reporting layer-wise effects: the exact layer indices and dataset sizes for the peak-sensitivity results (mid-depth in Qwen vs. 6-25% in others) are not stated with sufficient precision or statistical controls, making it difficult to assess whether the reported architecture dependence is robust or sensitive to post-hoc layer selection.
Authors: We accept that more precise reporting is required. The revised Results section now contains a new table that lists the exact layer index of peak sensitivity for every model–dataset combination (e.g., layer 17/36 for Qwen3-VL on the primary scene-text set, layers 4–7 for Phi-4), together with the exact sample counts (1,024 images per condition, with per-dataset breakdowns). We have also added standard-error bands computed across five random data splits and a brief sensitivity analysis demonstrating that the mid-depth versus early-layer distinction remains stable when the peak is defined by either absolute or relative activation difference thresholds. These changes eliminate ambiguity about post-hoc selection and confirm the robustness of the architecture-dependent pattern. revision: yes
Circularity Check
No significant circularity in empirical derivation chain
full rationale
The paper's central claims rest on direct empirical comparisons of activations between original images and text-inpainted versions, followed by PCA decomposition and performance measurements on downstream tasks. These steps are data-driven and do not reduce by construction to fitted parameters, self-definitions, or prior self-citations; the inpainting intervention and variance analysis are independently verifiable through replication on held-out data. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear in the derivation. The architecture-specific bottleneck locations and low-dimensional OCR signal emerge from the experimental measurements rather than being presupposed in the method.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Text inpainting removes OCR content while leaving other visual features and model processing largely unchanged
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks... PC1 captures up to 72.9% of variance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Don't Look at the Numbers: Visual Anchoring Bias and Layer-wise Representation in VLMs
Numeric anchors embedded in images systematically bias VLM quality judgments more than severe visual degradation, with layer-wise probing showing that anchor-saturated layers are suboptimal for quality prediction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.