Recognition: no theorem link
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
Pith reviewed 2026-05-15 05:10 UTC · model grok-4.3
The pith
Vision-language models improve both perception and reasoning by routing rewards to the specific source of error via blindfolded verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The root cause of the perception-reasoning trade-off is ambiguity in modality credit assignment; when a VLM fails it is unclear whether the error stems from bad seeing or bad thinking. Decomposing generation into interleaved perception and reasoning steps, then rewarding perceptual fidelity independently through a blindfolded reasoning proxy inside Perception Verification, allows Modality-Aware Credit Assignment to route credit correctly and produce joint improvements across free-form vision-language tasks.
What carries the argument
Modality-Aware Credit Assignment (MoCA) mechanism that integrates Perception Verification, which uses a blindfolded reasoning proxy to assess and reward perception fidelity separately from reasoning outcomes.
If this is right
- A single VLM achieves performance lifts on both perception and reasoning benchmarks without architectural redesign or external agentic pipelines.
- Structured Verbal Verification scales the approach to open-ended vision-language tasks by replacing high-variance LLM-as-judge methods.
- Targeted credit assignment eliminates the need for heavy compute investment that previously yielded only marginal or one-sided gains.
Where Pith is reading between the lines
- The same credit-assignment logic could apply to other multimodal settings where one modality's error is hard to disentangle from another's.
- Training costs may drop further if the structured verification step replaces LLM judges at scale.
- The proxy's reliability could be stress-tested on datasets with deliberately introduced visual noise to check robustness.
Load-bearing premise
The blindfolded reasoning proxy can isolate and measure perceptual fidelity without introducing new biases or depending on the correctness of the subsequent reasoning steps.
What would settle it
An ablation that removes the blindfolded proxy or the separate perception reward shows the simultaneous gains on perception and reasoning benchmarks disappear.
Figures
read the original abstract
Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the seesaw effect between perception and reasoning performance in VLMs arises from ambiguity in modality credit assignment (bad seeing vs. bad thinking). It proposes an RL framework that decomposes generation into interleaved perception and reasoning steps, introduces Perception Verification using a blindfolded reasoning proxy to reward perceptual fidelity independently of reasoning outcomes, replaces high-variance LLM judging with Structured Verbal Verification for scalability, and integrates these into a Modality-Aware Credit Assignment (MoCA) mechanism to route rewards specifically to perception or reasoning errors.
Significance. If the blindfolded proxy and MoCA successfully isolate perceptual credit without leakage or new biases, the approach could offer a lighter-weight alternative to architectural redesigns or external agentic workflows, enabling simultaneous gains across perception and reasoning tasks in a single VLM. The shift to structured algorithmic verification for free-form tasks is a practical strength that could improve reproducibility.
major comments (2)
- [Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.
- [Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.
minor comments (2)
- [Abstract] Define the 'seesaw effect' quantitatively with reference to prior measurements rather than using it as an informal descriptor.
- [MoCA mechanism] Clarify the exact reward formulation in MoCA, including how perception and reasoning rewards are combined and any hyperparameters involved.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below with clarifications on our method design and agree to expand the manuscript with additional details and experiments to strengthen the presentation.
read point-by-point responses
-
Referee: [Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.
Authors: We agree that explicit validation of separation is essential given shared parameters in VLMs. The blindfolded proxy is constructed by withholding all visual input from the reasoning component while providing only the textual perception output, ensuring it cannot access or leak perceptual features; rewards are then assigned solely based on whether the perception step enables correct downstream reasoning without visual access. To address the concern directly, we will add a dedicated ablation subsection in the revised manuscript with controlled experiments (e.g., measuring reward correlation across modalities and comparing against non-blindfolded baselines) that demonstrate clean credit routing via MoCA without measurable leakage or new biases. revision: yes
-
Referee: [Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.
Authors: The decomposition is enforced through a structured prompting template that requires the VLM to output perception steps (visual descriptions) and reasoning steps in an explicitly interleaved format with boundary markers (e.g., [PERCEPTION] and [REASONING] tags). During RL training, MoCA computes modality-specific rewards using Perception Verification for perception steps and standard outcome rewards for reasoning steps, with separate loss terms applied to each. We will revise the method section to include the exact prompting templates, step-boundary constraints, and loss formulations, clarifying that this yields unambiguous credit assignment rather than generic auxiliary signals. revision: yes
Circularity Check
No significant circularity: external verification steps introduced independently
full rationale
The paper's central derivation decomposes generation into perception and reasoning steps then applies Perception Verification (blindfolded reasoning proxy) plus Structured Verbal Verification inside MoCA to route modality-specific rewards. These components are defined as new external mechanisms that operate on the model's outputs rather than being defined in terms of those outputs or any fitted parameters. No equations appear in the provided text, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the claimed gains to the inputs by construction. The framework therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Blindfolded reasoning proxy isolates perceptual fidelity from reasoning quality
invented entities (1)
-
Modality-Aware Credit Assignment (MoCA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
FirstName LastName , title =
-
[2]
FirstName Alpher , title =
-
[3]
Journal of Foo , volume = 13, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =
-
[4]
Journal of Foo , volume = 14, number = 1, pages =
FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =
-
[5]
FirstName Alpher and FirstName Gamow , title =
-
[6]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=
work page 2020
-
[7]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=
work page 2018
-
[8]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=
work page 2021
-
[9]
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding , author=. 2024 , eprint=
work page 2024
-
[10]
ColPali: Efficient Document Retrieval with Vision Language Models , author=. 2025 , eprint=
work page 2025
-
[11]
Unifying Multimodal Retrieval via Document Screenshot Embedding , author=. 2024 , eprint=
work page 2024
-
[12]
REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=
work page 2023
-
[13]
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=
work page 2025
-
[14]
arXiv preprint arXiv:2402.11411 , year=
Aligning modalities in vision large language models via preference fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=
-
[15]
arXiv preprint arXiv:2502.00691 , year=
To code or not to code? adaptive tool integration for math language models via expectation-maximization , author=. arXiv preprint arXiv:2502.00691 , year=
-
[16]
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time , author=. arXiv preprint arXiv:2604.11626 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2509.03646 , year=
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. arXiv preprint arXiv:2509.03646 , year=
-
[18]
Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025
Reverse-Engineered Reasoning for Open-Ended Generation , author=. arXiv preprint arXiv:2509.06160 , year=
-
[19]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation , author=. 2026 , eprint=
work page 2026
-
[21]
CADMorph: Geometry-Driven Parametric CAD Editing via a Plan--Generate--Verify Loop , author=
-
[22]
arXiv preprint arXiv:2503.03556 , year=
Afford-x: Generalizable and slim affordance reasoning for task-oriented manipulation , author=. arXiv preprint arXiv:2503.03556 , year=
-
[23]
arXiv preprint arXiv:2509.05661 , year=
OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation , author=. arXiv preprint arXiv:2509.05661 , year=
-
[24]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i6.32640 , number=
-
[25]
Ding, Tianwei and He, Lanshan and Ma, Weijian and Zhou, Xiangdong , title =. 2025 , isbn =. doi:10.1145/3731715.3733405 , booktitle =
-
[26]
Proceedings of the 32nd ACM International Conference on Multimedia , pages=
Self-supervised visual preference alignment , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=
-
[27]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
- [28]
-
[29]
arXiv preprint arXiv:2512.12658 , year=
CogDoc: Towards Unified thinking in Documents , author=. arXiv preprint arXiv:2512.12658 , year=
-
[30]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Benchmarking multimodal knowledge conflict for large multimodal models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[31]
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding , author=. 2023 , eprint=
work page 2023
-
[32]
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. 2023 , eprint=
work page 2023
-
[33]
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding , author=. 2024 , eprint=
work page 2024
-
[34]
DocLLM: A layout-aware generative language model for multimodal document understanding , author=. 2023 , eprint=
work page 2023
-
[35]
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding , author=. 2024 , eprint=
work page 2024
-
[36]
Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=
work page 2020
-
[37]
Improving Passage Retrieval with Zero-Shot Question Generation , author=. 2023 , eprint=
work page 2023
-
[38]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=
work page 2023
-
[39]
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents , author=. 2025 , eprint=
work page 2025
-
[40]
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling , author=. 2025 , eprint=
work page 2025
-
[41]
Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=
work page 2024
-
[42]
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[43]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. arXiv preprint arXiv:2504.08837 , year=
-
[44]
arXiv preprint arXiv:2509.01055 , year=
Verltool: Towards holistic agentic reinforcement learning with tool use , author=. arXiv preprint arXiv:2509.01055 , year=
-
[45]
arXiv preprint arXiv:2503.10582 , year=
Visualwebinstruct: Scaling up multimodal instruction data through web search , author=. arXiv preprint arXiv:2503.10582 , year=
-
[46]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[48]
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval , author=. 2024 , eprint=
work page 2024
-
[49]
Retrieval-Augmented Generation with Graphs (GraphRAG) , author=. 2025 , eprint=
work page 2025
-
[50]
DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering , author=. 2025 , eprint=
work page 2025
-
[51]
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. 2025 , eprint=
work page 2025
-
[52]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[53]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step , author=. 2025 , eprint=
work page 2025
-
[54]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey , author=. 2025 , eprint=
work page 2025
-
[55]
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration , author=. 2025 , eprint=
work page 2025
-
[56]
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. 2025 , eprint=
work page 2025
-
[57]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[58]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild , author=. arXiv preprint arXiv:2503.18892 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv preprint arXiv:2503.10460 , year=
Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. arXiv preprint arXiv:2503.10460 , year=
- [60]
-
[61]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ICLR , year=
-
[62]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[63]
Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[64]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[66]
arXiv preprint arXiv:1612.00410 , year=
Deep variational information bottleneck , author=. arXiv preprint arXiv:1612.00410 , year=
-
[67]
The information bottleneck method
The information bottleneck method , author=. arXiv preprint physics/0004057 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[69]
The optimal control of partially observable Markov processes , author=. 1971 , publisher=
work page 1971
-
[70]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[71]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=
work page 2024
-
[72]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=
-
[73]
HybridFlow: A Flexible and Efficient RLHF Framework , url=
Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=
-
[74]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=
work page 2025
-
[75]
Document Understanding Dataset and Evaluation (DUDE) , author=. 2023 , eprint=
work page 2023
-
[76]
SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images , author=. 2023 , eprint=
work page 2023
-
[77]
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations , author=. 2024 , eprint=
work page 2024
-
[78]
Reverse-Engineered Reasoning for Open-Ended Generation , author=. 2025 , eprint=
work page 2025
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.