pith. sign in

arxiv: 2507.16863 · v2 · pith:PKP5SBAWnew · submitted 2025-07-21 · 💻 cs.CV · cs.AI· cs.CL

Position: Reasoning After Perception Means Reasoning Without Vision

classification 💻 cs.CV cs.AIcs.CL
keywords reasoningperceptionvisualtextitlanguagespacecollapsedfailures
0
0 comments X
read the original abstract

A common belief in multimodal research is that the perceptual weaknesses of vision--language models can be compensated by stronger language reasoning (e.g., chain-of-thought, in-context learning, or external tools). We challenge this assumption. We argue that for a broad class of visual tasks hard to specify in language, failures stem from a structural fatality where the temporal decision of \textit{when} to reason strictly dictates the spatial constraint of \textit{where} reasoning takes place. When visual reasoning is deferred to language generation, current architectures do not merely delay computation; they displace it from the continuous visual representation to a discrete textual space. Consequently, the sequential ``Perception-then-Reasoning'' paradigm degenerates perception into a passive, one-off feature encoding process, rendering it functionally equivalent to ``Reasoning-in-Text-Space'', where task-critical spatial signals are collapsed before reasoning begins. We substantiate this claim with the Turing Eye Test (TET): tasks that must be resolved in \emph{visual space} and are hard to verbalize; results show text-only reasoning cannot remedy these perceptual failures. Our findings suggest rethinking the architectural divide: shifting from reasoning \textit{about} perception to reasoning \textit{within} perception. This facilitates actively reasoning-driven perception that operates directly on pixel-level visual representations, rather than within a collapsed textual space.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.