arxiv: 2605.14054 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CV

Recognition: no theorem link

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Haozhe Wang , Qixin Xu , Changpeng Wang , Taofeng Xue , Chong Peng , Wenhu Chen , Fangzhen Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords vision-language modelsreinforcement learningperception verificationmodality credit assignmentperception-reasoning synergyblindfolded reasoning

0 comments

The pith

Vision-language models improve both perception and reasoning by routing rewards to the specific source of error via blindfolded verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the seesaw effect in VLMs arises because errors cannot be clearly attributed to flawed perception or flawed logic. It decomposes model generation into separate perception and reasoning steps inside a reinforcement learning setup. A blindfolded reasoning proxy then verifies perception quality without relying on the final answer. Structured Verbal Verification replaces noisy LLM judges, and both feed into Modality-Aware Credit Assignment that delivers targeted rewards. The result is simultaneous gains on perception and reasoning tasks without external agents or heavy engineering.

Core claim

The root cause of the perception-reasoning trade-off is ambiguity in modality credit assignment; when a VLM fails it is unclear whether the error stems from bad seeing or bad thinking. Decomposing generation into interleaved perception and reasoning steps, then rewarding perceptual fidelity independently through a blindfolded reasoning proxy inside Perception Verification, allows Modality-Aware Credit Assignment to route credit correctly and produce joint improvements across free-form vision-language tasks.

What carries the argument

Modality-Aware Credit Assignment (MoCA) mechanism that integrates Perception Verification, which uses a blindfolded reasoning proxy to assess and reward perception fidelity separately from reasoning outcomes.

If this is right

A single VLM achieves performance lifts on both perception and reasoning benchmarks without architectural redesign or external agentic pipelines.
Structured Verbal Verification scales the approach to open-ended vision-language tasks by replacing high-variance LLM-as-judge methods.
Targeted credit assignment eliminates the need for heavy compute investment that previously yielded only marginal or one-sided gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same credit-assignment logic could apply to other multimodal settings where one modality's error is hard to disentangle from another's.
Training costs may drop further if the structured verification step replaces LLM judges at scale.
The proxy's reliability could be stress-tested on datasets with deliberately introduced visual noise to check robustness.

Load-bearing premise

The blindfolded reasoning proxy can isolate and measure perceptual fidelity without introducing new biases or depending on the correctness of the subsequent reasoning steps.

What would settle it

An ablation that removes the blindfolded proxy or the separate perception reward shows the simultaneous gains on perception and reasoning benchmarks disappear.

Figures

Figures reproduced from arXiv: 2605.14054 by Changpeng Wang, Chong Peng, Fangzhen Lin, Haozhe Wang, Qixin Xu, Taofeng Xue, Wenhu Chen.

**Figure 1.** Figure 1: Overview of MoCA. MoCA identifies "bad seeing" from "bad thinking" by separating Perception and Reasoning. We introduce Perception Verification (PV, top loop), which uses a "blindfolded" text reasoner proxy to reward the VLM’s perception independently of its reasoning. This, combined with Outcome Verification (OV, bottom loop), rewards for better perception-reasoning synergy. To render an economical imple… view at source ↗

**Figure 2.** Figure 2: Illustration of Interleaved Perception-Reasoning. novel "blindfolded reasoning" paradigm that validates visual fidelity by checking if a text-only reasoner can solve the task using the model’s descriptions alone. • We propose Structured Verbal Verification (SVV) for robust free-form evaluation, enabling us to internalize the benefits of agentic "thinking" without external computational overhead. 2. Method… view at source ↗

**Figure 3.** Figure 3: Instruction to elicit interleaved perception-reasoning. 2.3. Rewarding Perception: The "Blindfolded Reasoner" Test With perception now an explicit output (<recognition>...), our goal is to evaluate its quality. The core challenge is that there is no ground truth for an intermediate "perceptual thought". We cannot use a static label. Therefore, we must measure its quality functionally. Our central insight i… view at source ↗

**Figure 4.** Figure 4: Instruction Prompt of Structured Verbal Verification. To bridge this gap, we propose Structured Verbal Verification (SVV). Rather than asking an LLM Judge to vaguely "judge" the final answer Y or asking it to write Python code, we provide a structured natural language algorithm—a verification protocol—and instruct the judge to explicitly "execute" this protocol step-by-step (as shown in [PITH_FULL_IMAG… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames the VLM seesaw as a credit assignment issue and proposes a clean RL decomposition with blindfolded verification, but the proxy's ability to isolate perception looks shaky given shared representations.

read the letter

The main thing to know is that this paper treats the perception-reasoning trade-off in VLMs as an ambiguity in who gets the reward when the model fails. It decomposes generation into explicit interleaved perception and reasoning steps, then uses reinforcement learning to assign credit separately via a Modality-Aware Credit Assignment mechanism called MoCA. The key pieces are Perception Verification, which relies on a blindfolded reasoning proxy to score perceptual fidelity without letting final outcomes bleed in, and Structured Verbal Verification, which swaps high-variance LLM judging for structured algorithmic checks so the method scales to open-ended tasks. That combination is new enough to stand out from prior RL or agentic work on VLMs. The framing is direct and the proposal avoids the usual heavy external pipelines, which is a practical plus. The structured verification step in particular looks like a reasonable engineering move to cut noise. The soft spot is the blindfolded proxy itself. Because VLMs share parameters and training data across modalities, perceptual signals can easily leak into the reasoning path even under the blindfolded condition, which would break the clean credit routing the whole method depends on. The stress-test concern lands here; without strong ablations showing the separation actually holds, the seesaw may simply reappear in a different form. The abstract gives no numbers or error bars, so the full paper's experiments will have to carry the weight. This is for people working on RL training of VLMs or on verification-based credit assignment rather than prompting tricks. A reader who cares about reliable perception-reasoning synergy will find the concrete mechanism useful to examine. It deserves a serious referee because the problem is real and the approach is specific enough to test and improve.

Referee Report

2 major / 2 minor

Summary. The paper claims that the seesaw effect between perception and reasoning performance in VLMs arises from ambiguity in modality credit assignment (bad seeing vs. bad thinking). It proposes an RL framework that decomposes generation into interleaved perception and reasoning steps, introduces Perception Verification using a blindfolded reasoning proxy to reward perceptual fidelity independently of reasoning outcomes, replaces high-variance LLM judging with Structured Verbal Verification for scalability, and integrates these into a Modality-Aware Credit Assignment (MoCA) mechanism to route rewards specifically to perception or reasoning errors.

Significance. If the blindfolded proxy and MoCA successfully isolate perceptual credit without leakage or new biases, the approach could offer a lighter-weight alternative to architectural redesigns or external agentic workflows, enabling simultaneous gains across perception and reasoning tasks in a single VLM. The shift to structured algorithmic verification for free-form tasks is a practical strength that could improve reproducibility.

major comments (2)

[Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.
[Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.

minor comments (2)

[Abstract] Define the 'seesaw effect' quantitatively with reference to prior measurements rather than using it as an informal descriptor.
[MoCA mechanism] Clarify the exact reward formulation in MoCA, including how perception and reasoning rewards are combined and any hyperparameters involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications on our method design and agree to expand the manuscript with additional details and experiments to strengthen the presentation.

read point-by-point responses

Referee: [Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.

Authors: We agree that explicit validation of separation is essential given shared parameters in VLMs. The blindfolded proxy is constructed by withholding all visual input from the reasoning component while providing only the textual perception output, ensuring it cannot access or leak perceptual features; rewards are then assigned solely based on whether the perception step enables correct downstream reasoning without visual access. To address the concern directly, we will add a dedicated ablation subsection in the revised manuscript with controlled experiments (e.g., measuring reward correlation across modalities and comparing against non-blindfolded baselines) that demonstrate clean credit routing via MoCA without measurable leakage or new biases. revision: yes
Referee: [Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.

Authors: The decomposition is enforced through a structured prompting template that requires the VLM to output perception steps (visual descriptions) and reasoning steps in an explicitly interleaved format with boundary markers (e.g., [PERCEPTION] and [REASONING] tags). During RL training, MoCA computes modality-specific rewards using Perception Verification for perception steps and standard outcome rewards for reasoning steps, with separate loss terms applied to each. We will revise the method section to include the exact prompting templates, step-boundary constraints, and loss formulations, clarifying that this yields unambiguous credit assignment rather than generic auxiliary signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity: external verification steps introduced independently

full rationale

The paper's central derivation decomposes generation into perception and reasoning steps then applies Perception Verification (blindfolded reasoning proxy) plus Structured Verbal Verification inside MoCA to route modality-specific rewards. These components are defined as new external mechanisms that operate on the model's outputs rather than being defined in terms of those outputs or any fitted parameters. No equations appear in the provided text, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the claimed gains to the inputs by construction. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; the framework rests on the assumption that perception can be isolated and rewarded independently via proxy verification.

axioms (1)

domain assumption Blindfolded reasoning proxy isolates perceptual fidelity from reasoning quality
Central to Perception Verification mechanism described in abstract.

invented entities (1)

Modality-Aware Credit Assignment (MoCA) no independent evidence
purpose: Routes rewards specifically to perception or reasoning errors
New mechanism introduced to resolve credit assignment ambiguity.

pith-pipeline@v0.9.0 · 5582 in / 1210 out tokens · 50898 ms · 2026-05-15T05:10:40.425167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 6 internal anchors

[1]

FirstName LastName , title =

work page
[2]

FirstName Alpher , title =

work page
[3]

Journal of Foo , volume = 13, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

work page
[4]

Journal of Foo , volume = 14, number = 1, pages =

FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

work page
[5]

FirstName Alpher and FirstName Gamow , title =

work page
[6]

2020 , eprint=

Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

work page 2020
[7]

2018 , eprint=

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

work page 2018
[8]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021
[9]

2024 , eprint=

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding , author=. 2024 , eprint=

work page 2024
[10]

2025 , eprint=

ColPali: Efficient Document Retrieval with Vision Language Models , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

Unifying Multimodal Retrieval via Document Screenshot Embedding , author=. 2024 , eprint=

work page 2024
[12]

2023 , eprint=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

work page 2023
[13]

2025 , eprint=

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

work page 2025
[14]

arXiv preprint arXiv:2402.11411 , year=

Aligning modalities in vision large language models via preference fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

work page arXiv
[15]

arXiv preprint arXiv:2502.00691 , year=

To code or not to code? adaptive tool integration for math language models via expectation-maximization , author=. arXiv preprint arXiv:2502.00691 , year=

work page arXiv
[16]

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time , author=. arXiv preprint arXiv:2604.11626 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2509.03646 , year=

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. arXiv preprint arXiv:2509.03646 , year=

work page arXiv
[18]

Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

Reverse-Engineered Reasoning for Open-Ended Generation , author=. arXiv preprint arXiv:2509.06160 , year=

work page arXiv
[19]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2026 , eprint=

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation , author=. 2026 , eprint=

work page 2026
[21]

CADMorph: Geometry-Driven Parametric CAD Editing via a Plan--Generate--Verify Loop , author=

work page
[22]

arXiv preprint arXiv:2503.03556 , year=

Afford-x: Generalizable and slim affordance reasoning for task-oriented manipulation , author=. arXiv preprint arXiv:2503.03556 , year=

work page arXiv
[23]

arXiv preprint arXiv:2509.05661 , year=

OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation , author=. arXiv preprint arXiv:2509.05661 , year=

work page arXiv
[24]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i6.32640 , number=

work page doi:10.1609/aaai.v39i6.32640 2025
[25]

2025 , isbn =

Ding, Tianwei and He, Lanshan and Ma, Weijian and Zhou, Xiangdong , title =. 2025 , isbn =. doi:10.1145/3731715.3733405 , booktitle =

work page doi:10.1145/3731715.3733405 2025
[26]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Self-supervised visual preference alignment , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

2022 , eprint=

OCR-free Document Understanding Transformer , author=. 2022 , eprint=

work page 2022
[29]

arXiv preprint arXiv:2512.12658 , year=

CogDoc: Towards Unified thinking in Documents , author=. arXiv preprint arXiv:2512.12658 , year=

work page arXiv
[30]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Benchmarking multimodal knowledge conflict for large multimodal models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[31]

2023 , eprint=

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding , author=. 2023 , eprint=

work page 2023
[32]

2023 , eprint=

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. 2023 , eprint=

work page 2023
[33]

2024 , eprint=

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding , author=. 2024 , eprint=

work page 2024
[34]

2023 , eprint=

DocLLM: A layout-aware generative language model for multimodal document understanding , author=. 2023 , eprint=

work page 2023
[35]

2024 , eprint=

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding , author=. 2024 , eprint=

work page 2024
[36]

2020 , eprint=

Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=

work page 2020
[37]

2023 , eprint=

Improving Passage Retrieval with Zero-Shot Question Generation , author=. 2023 , eprint=

work page 2023
[38]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

work page 2023
[39]

2025 , eprint=

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents , author=. 2025 , eprint=

work page 2025
[40]

2025 , eprint=

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling , author=. 2025 , eprint=

work page 2025
[41]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[42]

2025 , eprint=

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[43]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. arXiv preprint arXiv:2504.08837 , year=

work page Pith review arXiv
[44]

arXiv preprint arXiv:2509.01055 , year=

Verltool: Towards holistic agentic reinforcement learning with tool use , author=. arXiv preprint arXiv:2509.01055 , year=

work page arXiv
[45]

arXiv preprint arXiv:2503.10582 , year=

Visualwebinstruct: Scaling up multimodal instruction data through web search , author=. arXiv preprint arXiv:2503.10582 , year=

work page arXiv
[46]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[48]

2024 , eprint=

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval , author=. 2024 , eprint=

work page 2024
[49]

2025 , eprint=

Retrieval-Augmented Generation with Graphs (GraphRAG) , author=. 2025 , eprint=

work page 2025
[50]

2025 , eprint=

DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering , author=. 2025 , eprint=

work page 2025
[51]

2025 , eprint=

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. 2025 , eprint=

work page 2025
[52]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[53]

2025 , eprint=

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step , author=. 2025 , eprint=

work page 2025
[54]

2025 , eprint=

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey , author=. 2025 , eprint=

work page 2025
[55]

2025 , eprint=

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration , author=. 2025 , eprint=

work page 2025
[56]

2025 , eprint=

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. 2025 , eprint=

work page 2025
[57]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

work page 2024
[58]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild , author=. arXiv preprint arXiv:2503.18892 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2503.10460 , year=

Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. arXiv preprint arXiv:2503.10460 , year=

work page arXiv
[60]

Instruct

Wenliang Dai and Junnan Li and Dongxu Li and Anthony Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven Hoi , booktitle=. Instruct. 2023 , url=

work page 2023
[61]

ICLR , year=

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ICLR , year=

work page
[62]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[63]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[64]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[66]

arXiv preprint arXiv:1612.00410 , year=

Deep variational information bottleneck , author=. arXiv preprint arXiv:1612.00410 , year=

work page arXiv
[67]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[69]

1971 , publisher=

The optimal control of partially observable Markov processes , author=. 1971 , publisher=

work page 1971
[70]

2025 , eprint=

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[71]

2024 , eprint=

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

work page 2024
[72]

Proceedings of CVPR , year=

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

work page
[73]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=

work page doi:10.1145/3689031.3696075
[74]

2025 , eprint=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

work page 2025
[75]

2023 , eprint=

Document Understanding Dataset and Evaluation (DUDE) , author=. 2023 , eprint=

work page 2023
[76]

2023 , eprint=

SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images , author=. 2023 , eprint=

work page 2023
[77]

2024 , eprint=

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations , author=. 2024 , eprint=

work page 2024
[78]

2025 , eprint=

Reverse-Engineered Reasoning for Open-Ended Generation , author=. 2025 , eprint=

work page 2025
[79]

2025 , eprint=

Qwen2.5-VL Technical Report , author=. 2025 , eprint=

work page 2025
[80]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

work page 2024

Showing first 80 references.