Position: Reasoning After Perception Means Reasoning Without Vision
read the original abstract
A common belief in multimodal research is that the perceptual weaknesses of vision--language models can be compensated by stronger language reasoning (e.g., chain-of-thought, in-context learning, or external tools). We challenge this assumption. We argue that for a broad class of visual tasks hard to specify in language, failures stem from a structural fatality where the temporal decision of \textit{when} to reason strictly dictates the spatial constraint of \textit{where} reasoning takes place. When visual reasoning is deferred to language generation, current architectures do not merely delay computation; they displace it from the continuous visual representation to a discrete textual space. Consequently, the sequential ``Perception-then-Reasoning'' paradigm degenerates perception into a passive, one-off feature encoding process, rendering it functionally equivalent to ``Reasoning-in-Text-Space'', where task-critical spatial signals are collapsed before reasoning begins. We substantiate this claim with the Turing Eye Test (TET): tasks that must be resolved in \emph{visual space} and are hard to verbalize; results show text-only reasoning cannot remedy these perceptual failures. Our findings suggest rethinking the architectural divide: shifting from reasoning \textit{about} perception to reasoning \textit{within} perception. This facilitates actively reasoning-driven perception that operates directly on pixel-level visual representations, rather than within a collapsed textual space.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.