arxiv: 2505.15966 · v3 · submitted 2025-05-21 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang , Alex Su , Weiming Ren , Fangzhen Lin , Wenhu Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 02:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords pixel-space reasoningvision-language modelscuriosity-driven reinforcement learningvisual operationschain-of-thoughtvisual reasoning benchmarksVLM training

0 comments

The pith

Vision-language models can reason directly in pixel space using operations like zoom-in and frame selection to achieve new open-source highs on visual benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that vision-language models should reason directly in pixel space rather than relying solely on textual chain-of-thought. It introduces visual operations such as zoom-in and select-frame that allow models to inspect and infer from visual evidence. To train this capability the authors use a two-phase method: instruction tuning on synthesized reasoning traces followed by reinforcement learning with a curiosity-driven reward to encourage exploration of these new operations. This results in a 7B parameter model achieving 84 percent on V* bench, 74 percent on TallyQA-Complex, and 84 percent on InfographicsVQA. A sympathetic reader would care because it suggests that expanding reasoning beyond text can unlock better performance on tasks involving rich visual inputs like images and videos.

Core claim

Within this novel framework Vision-Language Models are equipped with a suite of visual reasoning operations such as zoom-in and select-frame. These operations enable VLMs to directly inspect interrogate and infer from visual evidences. Cultivating such pixel-space reasoning capabilities in VLMs is addressed through a two-phase training approach consisting of instruction tuning on synthesized reasoning traces followed by reinforcement learning with a curiosity-driven reward scheme to balance exploration between pixel-space and textual reasoning. This significantly improves VLM performance across diverse visual reasoning benchmarks with the 7B model achieving 84 percent on V* bench 74 percent

What carries the argument

The two-phase training framework: instruction tuning on synthesized reasoning traces to teach pixel operations followed by curiosity-driven reinforcement learning to balance their use with textual reasoning.

If this is right

VLMs can interact with complex visual inputs such as information-rich images or videos to proactively gather necessary information.
Performance improves significantly on diverse visual reasoning benchmarks including V* bench TallyQA-Complex and InfographicsVQA.
The 7B open-source model reaches the highest accuracy reported for any open-source system on these tasks.
Pixel-space reasoning enhances reasoning fidelity for visually intensive tasks compared to text-only chain-of-thought.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curiosity mechanism could be applied to introduce other new tool sets such as audio or 3D manipulation operations in multimodal models.
Evaluation protocols for VLMs may need to shift toward measuring actual use of visual operations rather than final answer accuracy alone.
Removing the separate instruction-tuning phase might still work if the curiosity reward is strong enough from the start.

Load-bearing premise

The curiosity-driven reward scheme will successfully balance exploration of pixel-space operations with textual reasoning without the model reverting to familiar text-only strategies or exploiting the reward in unintended ways.

What would settle it

An ablation study showing that disabling the pixel operations after training produces no drop in accuracy on fine-detail visual tasks or that the model rarely invokes zoom-in or select-frame during inference would falsify the central claim.

read the original abstract

Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pixel Reasoner shows VLMs can be trained to use pixel ops like zoom-in via curiosity RL, delivering strong open-source benchmark scores, but the evidence that the RL step is what drives the pixel usage is thin.

read the letter

The main thing here is that they train a 7B VLM to call pixel-space operations such as zoom-in and select-frame during reasoning instead of staying in text. They do this with a two-phase process: first instruction tuning on synthetic traces to teach the operations, then RL with a curiosity term that rewards trying the new actions while still solving the task. The reported numbers are competitive: 84% on V*, 74% on TallyQA-Complex, and 84% on InfographicsVQA, which are the best open-source results listed for those sets.

Referee Report

3 major / 2 minor

Summary. The paper introduces pixel-space reasoning for VLMs by adding operations such as zoom-in and select-frame that allow direct inspection of visual inputs. It uses a two-phase pipeline: instruction tuning on synthesized traces to teach the operations, followed by curiosity-driven RL to encourage their use alongside textual reasoning. The 7B Pixel Reasoner model is reported to reach 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, the highest among open-source models.

Significance. If the gains are shown to arise from sustained pixel-space operations rather than generic fine-tuning, the framework could meaningfully extend visual reasoning beyond text-only CoT by enabling proactive visual evidence gathering on complex images and videos.

major comments (3)

[Methods and Experiments] Methods/Experiments: No ablation compares the model after instruction tuning alone versus after the full curiosity-driven RL phase on either benchmark accuracy or pixel-operation usage frequency. Because the abstract explicitly notes the model's initial reluctance to adopt the new operations, this comparison is required to establish that the curiosity term (rather than additional SFT) is the load-bearing driver of the claimed improvements.
[Results] Results: The manuscript provides no quantitative statistics on the frequency or distribution of pixel-space operations (zoom-in, select-frame, etc.) before versus after the RL stage. Without these data it remains unclear whether the model sustains the intended pixel reasoning or reverts to text-only strategies, directly undermining the central claim that the framework induces pixel-space reasoning.
[Training Details] Training Details: The description of the curiosity-driven reward (how the curiosity term is computed, weighted against task reward, and prevented from exploitation) lacks sufficient implementation specifics and hyperparameter values to allow reproduction or verification that the reported gains are not artifacts of the particular reward schedule.

minor comments (2)

[Abstract] Abstract: The claim of 'highest accuracy achieved by any open-source model to date' should be accompanied by the exact scores of the strongest competing open-source baselines for immediate verification.
[Methods] Notation: The distinction between the synthesized reasoning traces used in phase 1 and the actual model-generated traces during RL should be clarified to avoid reader confusion about data provenance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Methods and Experiments] Methods/Experiments: No ablation compares the model after instruction tuning alone versus after the full curiosity-driven RL phase on either benchmark accuracy or pixel-operation usage frequency. Because the abstract explicitly notes the model's initial reluctance to adopt the new operations, this comparison is required to establish that the curiosity term (rather than additional SFT) is the load-bearing driver of the claimed improvements.

Authors: We agree that an explicit ablation isolating the contribution of the curiosity-driven RL phase is important. In the revised manuscript we will add a direct comparison of the model after instruction tuning alone versus after the full RL stage, reporting both benchmark accuracies and pixel-operation usage frequencies. This will clarify that the observed gains are driven by the RL phase rather than additional supervised fine-tuning. revision: yes
Referee: [Results] Results: The manuscript provides no quantitative statistics on the frequency or distribution of pixel-space operations (zoom-in, select-frame, etc.) before versus after the RL stage. Without these data it remains unclear whether the model sustains the intended pixel reasoning or reverts to text-only strategies, directly undermining the central claim that the framework induces pixel-space reasoning.

Authors: We acknowledge the need for quantitative evidence of sustained pixel-space operation usage. We will add tables and figures in the revised manuscript that report the frequency and distribution of operations such as zoom-in and select-frame before and after the RL stage. These statistics will demonstrate that the model continues to employ pixel-space reasoning rather than reverting to text-only strategies. revision: yes
Referee: [Training Details] Training Details: The description of the curiosity-driven reward (how the curiosity term is computed, weighted against task reward, and prevented from exploitation) lacks sufficient implementation specifics and hyperparameter values to allow reproduction or verification that the reported gains are not artifacts of the particular reward schedule.

Authors: We will expand the training details section to provide the precise formulation of the curiosity term, its weighting relative to the task reward, any safeguards against exploitation, and all relevant hyperparameter values. These additions will enable full reproduction and verification of the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training pipeline

full rationale

The paper presents an empirical two-phase training procedure (instruction tuning on synthesized traces, followed by curiosity-driven RL) whose outputs are measured accuracies on external held-out benchmarks (V* bench, TallyQA-Complex, InfographicsVQA). No equations, predictions, or uniqueness claims are offered that reduce by construction to fitted parameters, self-citations, or ansatzes imported from the authors' prior work. The central results are direct empirical measurements rather than derived quantities that loop back to the training inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that synthesized reasoning traces provide a sufficient starting point and that a curiosity reward can overcome the model's initial bias toward text-only reasoning; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption VLMs can be effectively trained to adopt new pixel-space operations through a combination of instruction tuning and curiosity-driven RL.
Stated as the solution to the model's initial reluctance and imbalanced competence.

pith-pipeline@v0.9.0 · 5602 in / 1118 out tokens · 35413 ms · 2026-05-14T02:17:25.171924+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We introduce the concept of pixel-space reasoning... VLMs are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame... a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our 7B model, Pixel Reasoner, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA
Foundation.DiscretenessForcing discreteness_forcing_principle unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the model’s initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations... curiosity-driven reward scheme

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
cs.AI 2026-05 conditional novelty 7.0

State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
Visual Reasoning through Tool-supervised Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

ToolsRL trains MLLMs via a tool-specific then accuracy-focused RL curriculum to master visual tools for complex reasoning tasks.
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
cs.CV 2026-04 unverdicted novelty 6.0

AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
cs.IR 2026-04 unverdicted novelty 6.0

ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
cs.CV 2026-04 unverdicted novelty 5.0

Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images
cs.CV 2026-04 unverdicted novelty 5.0

TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolutio...
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...