SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Hang Hua; Junling Wang; Konrad Schindler; Li Mi; Mattia Rigotti; Nayanika Debnath; Niccolo Avogaro; Thomas Frick; Zexue He

arxiv: 2602.06566 · v3 · pith:3MPOGIXLnew · submitted 2026-02-06 · 💻 cs.CV · cs.AI· cs.CL

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro , Nayanika Debnath , Li Mi , Thomas Frick , Junling Wang , Zexue He , Hang Hua , Konrad Schindler

show 1 more author

Mattia Rigotti

This is my paper

classification 💻 cs.CV cs.AIcs.CL

keywords reasoningvisualperceptionsparcperceptualprocessingregionsscaling

0 comments

read the original abstract

Despite recent successes, test-time scaling -- i.e., dynamically expanding the token budget during inference as needed -- remains brittle for vision-language models (VLMs). Unstructured visual reasoning chains entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Reasoning also requires expensive reinforcement learning with hand-crafted rewards. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), and supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance). It also accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing visual token count and compute. SPARC outperforms monolithic baselines and strong visual-grounding approaches across challenging visual reasoning tasks, such as improving Qwen3VL 4B on the $V^*$ VQA benchmark by 6.7 points and surpassing "thinking with images" by 4.6 points in an OOD setting with a $200\times$ lower token budget.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
cs.CV 2026-05 conditional novelty 6.0

MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 conditional novelty 6.0

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.