pith. machine review for the scientific record. sign in

arxiv: 2605.14054 · v1 · submitted 2026-05-13 · 💻 cs.AI · cs.CV

Recognition: no theorem link

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords vision-language modelsreinforcement learningperception verificationmodality credit assignmentperception-reasoning synergyblindfolded reasoning
0
0 comments X

The pith

Vision-language models improve both perception and reasoning by routing rewards to the specific source of error via blindfolded verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the seesaw effect in VLMs arises because errors cannot be clearly attributed to flawed perception or flawed logic. It decomposes model generation into separate perception and reasoning steps inside a reinforcement learning setup. A blindfolded reasoning proxy then verifies perception quality without relying on the final answer. Structured Verbal Verification replaces noisy LLM judges, and both feed into Modality-Aware Credit Assignment that delivers targeted rewards. The result is simultaneous gains on perception and reasoning tasks without external agents or heavy engineering.

Core claim

The root cause of the perception-reasoning trade-off is ambiguity in modality credit assignment; when a VLM fails it is unclear whether the error stems from bad seeing or bad thinking. Decomposing generation into interleaved perception and reasoning steps, then rewarding perceptual fidelity independently through a blindfolded reasoning proxy inside Perception Verification, allows Modality-Aware Credit Assignment to route credit correctly and produce joint improvements across free-form vision-language tasks.

What carries the argument

Modality-Aware Credit Assignment (MoCA) mechanism that integrates Perception Verification, which uses a blindfolded reasoning proxy to assess and reward perception fidelity separately from reasoning outcomes.

If this is right

  • A single VLM achieves performance lifts on both perception and reasoning benchmarks without architectural redesign or external agentic pipelines.
  • Structured Verbal Verification scales the approach to open-ended vision-language tasks by replacing high-variance LLM-as-judge methods.
  • Targeted credit assignment eliminates the need for heavy compute investment that previously yielded only marginal or one-sided gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same credit-assignment logic could apply to other multimodal settings where one modality's error is hard to disentangle from another's.
  • Training costs may drop further if the structured verification step replaces LLM judges at scale.
  • The proxy's reliability could be stress-tested on datasets with deliberately introduced visual noise to check robustness.

Load-bearing premise

The blindfolded reasoning proxy can isolate and measure perceptual fidelity without introducing new biases or depending on the correctness of the subsequent reasoning steps.

What would settle it

An ablation that removes the blindfolded proxy or the separate perception reward shows the simultaneous gains on perception and reasoning benchmarks disappear.

Figures

Figures reproduced from arXiv: 2605.14054 by Changpeng Wang, Chong Peng, Fangzhen Lin, Haozhe Wang, Qixin Xu, Taofeng Xue, Wenhu Chen.

Figure 1
Figure 1. Figure 1: Overview of MoCA. MoCA identifies "bad seeing" from "bad thinking" by separating Perception and Reasoning. We introduce Perception Verification (PV, top loop), which uses a "blindfolded" text reasoner proxy to reward the VLM’s perception independently of its reasoning. This, combined with Outcome Veri￾fication (OV, bottom loop), rewards for better perception-reasoning synergy. To render an economical imple… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Interleaved Perception-Reasoning. novel "blindfolded reasoning" paradigm that validates visual fidelity by checking if a text-only reasoner can solve the task using the model’s descriptions alone. • We propose Structured Verbal Verification (SVV) for robust free-form evaluation, enabling us to internal￾ize the benefits of agentic "thinking" without external computational overhead. 2. Method… view at source ↗
Figure 3
Figure 3. Figure 3: Instruction to elicit interleaved perception-reasoning. 2.3. Rewarding Perception: The "Blindfolded Reasoner" Test With perception now an explicit output (<recognition>...), our goal is to evaluate its quality. The core challenge is that there is no ground truth for an intermediate "perceptual thought". We cannot use a static label. Therefore, we must measure its quality functionally. Our central insight i… view at source ↗
Figure 4
Figure 4. Figure 4: Instruction Prompt of Structured Verbal Verification. To bridge this gap, we propose Structured Verbal Verifica￾tion (SVV). Rather than asking an LLM Judge to vaguely "judge" the final answer Y or asking it to write Python code, we provide a structured natural language algorithm—a veri￾fication protocol—and instruct the judge to explicitly "ex￾ecute" this protocol step-by-step (as shown in [PITH_FULL_IMAG… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
read the original abstract

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the seesaw effect between perception and reasoning performance in VLMs arises from ambiguity in modality credit assignment (bad seeing vs. bad thinking). It proposes an RL framework that decomposes generation into interleaved perception and reasoning steps, introduces Perception Verification using a blindfolded reasoning proxy to reward perceptual fidelity independently of reasoning outcomes, replaces high-variance LLM judging with Structured Verbal Verification for scalability, and integrates these into a Modality-Aware Credit Assignment (MoCA) mechanism to route rewards specifically to perception or reasoning errors.

Significance. If the blindfolded proxy and MoCA successfully isolate perceptual credit without leakage or new biases, the approach could offer a lighter-weight alternative to architectural redesigns or external agentic workflows, enabling simultaneous gains across perception and reasoning tasks in a single VLM. The shift to structured algorithmic verification for free-form tasks is a practical strength that could improve reproducibility.

major comments (2)
  1. [Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.
  2. [Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.
minor comments (2)
  1. [Abstract] Define the 'seesaw effect' quantitatively with reference to prior measurements rather than using it as an informal descriptor.
  2. [MoCA mechanism] Clarify the exact reward formulation in MoCA, including how perception and reasoning rewards are combined and any hyperparameters involved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below with clarifications on our method design and agree to expand the manuscript with additional details and experiments to strengthen the presentation.

read point-by-point responses
  1. Referee: [Perception Verification and MoCA description] The load-bearing assumption is that the blindfolded reasoning proxy in Perception Verification can evaluate perceptual fidelity independently of reasoning outcomes. Given that VLMs use shared parameters and entangled representations, the proxy may still allow leakage of perceptual features into the reasoning path or permit the verification LLM to use non-perceptual cues, undermining clean credit routing via MoCA. This requires explicit experimental validation (e.g., controlled ablations showing separation) rather than high-level description.

    Authors: We agree that explicit validation of separation is essential given shared parameters in VLMs. The blindfolded proxy is constructed by withholding all visual input from the reasoning component while providing only the textual perception output, ensuring it cannot access or leak perceptual features; rewards are then assigned solely based on whether the perception step enables correct downstream reasoning without visual access. To address the concern directly, we will add a dedicated ablation subsection in the revised manuscript with controlled experiments (e.g., measuring reward correlation across modalities and comparing against non-blindfolded baselines) that demonstrate clean credit routing via MoCA without measurable leakage or new biases. revision: yes

  2. Referee: [Method overview] The decomposition into interleaved perception and reasoning steps is presented as enabling targeted supervision, but no details are given on how this decomposition is enforced during generation or training (e.g., prompting strategy, loss terms, or constraints on step boundaries). Without this, it is unclear whether the credit assignment in MoCA is unambiguous or reduces to standard RL with auxiliary signals.

    Authors: The decomposition is enforced through a structured prompting template that requires the VLM to output perception steps (visual descriptions) and reasoning steps in an explicitly interleaved format with boundary markers (e.g., [PERCEPTION] and [REASONING] tags). During RL training, MoCA computes modality-specific rewards using Perception Verification for perception steps and standard outcome rewards for reasoning steps, with separate loss terms applied to each. We will revise the method section to include the exact prompting templates, step-boundary constraints, and loss formulations, clarifying that this yields unambiguous credit assignment rather than generic auxiliary signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity: external verification steps introduced independently

full rationale

The paper's central derivation decomposes generation into perception and reasoning steps then applies Perception Verification (blindfolded reasoning proxy) plus Structured Verbal Verification inside MoCA to route modality-specific rewards. These components are defined as new external mechanisms that operate on the model's outputs rather than being defined in terms of those outputs or any fitted parameters. No equations appear in the provided text, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming reduces the claimed gains to the inputs by construction. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; the framework rests on the assumption that perception can be isolated and rewarded independently via proxy verification.

axioms (1)
  • domain assumption Blindfolded reasoning proxy isolates perceptual fidelity from reasoning quality
    Central to Perception Verification mechanism described in abstract.
invented entities (1)
  • Modality-Aware Credit Assignment (MoCA) no independent evidence
    purpose: Routes rewards specifically to perception or reasoning errors
    New mechanism introduced to resolve credit assignment ambiguity.

pith-pipeline@v0.9.0 · 5582 in / 1210 out tokens · 50898 ms · 2026-05-15T05:10:40.425167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · 6 internal anchors

  1. [1]

    FirstName LastName , title =

  2. [2]

    FirstName Alpher , title =

  3. [3]

    Journal of Foo , volume = 13, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe , title =. Journal of Foo , volume = 13, number = 1, pages =

  4. [4]

    Journal of Foo , volume = 14, number = 1, pages =

    FirstName Alpher and FirstName Fotheringham-Smythe and FirstName Gamow , title =. Journal of Foo , volume = 14, number = 1, pages =

  5. [5]

    FirstName Alpher and FirstName Gamow , title =

  6. [6]

    2020 , eprint=

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , author=. 2020 , eprint=

  7. [7]

    2018 , eprint=

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , author=. 2018 , eprint=

  8. [8]

    2021 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

  9. [9]

    2024 , eprint=

    M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding , author=. 2024 , eprint=

  10. [10]

    2025 , eprint=

    ColPali: Efficient Document Retrieval with Vision Language Models , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    Unifying Multimodal Retrieval via Document Screenshot Embedding , author=. 2024 , eprint=

  12. [12]

    2023 , eprint=

    REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

  13. [13]

    2025 , eprint=

    VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents , author=. 2025 , eprint=

  14. [14]

    arXiv preprint arXiv:2402.11411 , year=

    Aligning modalities in vision large language models via preference fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

  15. [15]

    arXiv preprint arXiv:2502.00691 , year=

    To code or not to code? adaptive tool integration for math language models via expectation-maximization , author=. arXiv preprint arXiv:2502.00691 , year=

  16. [16]

    RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

    RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time , author=. arXiv preprint arXiv:2604.11626 , year=

  17. [17]

    arXiv preprint arXiv:2509.03646 , year=

    Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. arXiv preprint arXiv:2509.03646 , year=

  18. [18]

    Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

    Reverse-Engineered Reasoning for Open-Ended Generation , author=. arXiv preprint arXiv:2509.06160 , year=

  19. [19]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

  20. [20]

    2026 , eprint=

    Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation , author=. 2026 , eprint=

  21. [21]

    CADMorph: Geometry-Driven Parametric CAD Editing via a Plan--Generate--Verify Loop , author=

  22. [22]

    arXiv preprint arXiv:2503.03556 , year=

    Afford-x: Generalizable and slim affordance reasoning for task-oriented manipulation , author=. arXiv preprint arXiv:2503.03556 , year=

  23. [23]

    arXiv preprint arXiv:2509.05661 , year=

    OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation , author=. arXiv preprint arXiv:2509.05661 , year=

  24. [24]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i6.32640 , number=

  25. [25]

    2025 , isbn =

    Ding, Tianwei and He, Lanshan and Ma, Weijian and Zhou, Xiangdong , title =. 2025 , isbn =. doi:10.1145/3731715.3733405 , booktitle =

  26. [26]

    Proceedings of the 32nd ACM International Conference on Multimedia , pages=

    Self-supervised visual preference alignment , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

  27. [27]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  28. [28]

    2022 , eprint=

    OCR-free Document Understanding Transformer , author=. 2022 , eprint=

  29. [29]

    arXiv preprint arXiv:2512.12658 , year=

    CogDoc: Towards Unified thinking in Documents , author=. arXiv preprint arXiv:2512.12658 , year=

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Benchmarking multimodal knowledge conflict for large multimodal models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  31. [31]

    2023 , eprint=

    mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding , author=. 2023 , eprint=

  32. [32]

    2023 , eprint=

    UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. 2023 , eprint=

  33. [33]

    2024 , eprint=

    mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding , author=. 2024 , eprint=

  34. [34]

    2023 , eprint=

    DocLLM: A layout-aware generative language model for multimodal document understanding , author=. 2023 , eprint=

  35. [35]

    2024 , eprint=

    LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding , author=. 2024 , eprint=

  36. [36]

    2020 , eprint=

    Dense Passage Retrieval for Open-Domain Question Answering , author=. 2020 , eprint=

  37. [37]

    2023 , eprint=

    Improving Passage Retrieval with Zero-Shot Question Generation , author=. 2023 , eprint=

  38. [38]

    2023 , eprint=

    Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

  39. [39]

    2025 , eprint=

    ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling , author=. 2025 , eprint=

  41. [41]

    2024 , eprint=

    Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

  42. [42]

    2025 , eprint=

    Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning , author=. 2025 , eprint=

  43. [43]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning , author=. arXiv preprint arXiv:2504.08837 , year=

  44. [44]

    arXiv preprint arXiv:2509.01055 , year=

    Verltool: Towards holistic agentic reinforcement learning with tool use , author=. arXiv preprint arXiv:2509.01055 , year=

  45. [45]

    arXiv preprint arXiv:2503.10582 , year=

    Visualwebinstruct: Scaling up multimodal instruction data through web search , author=. arXiv preprint arXiv:2503.10582 , year=

  46. [46]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=

  47. [47]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  48. [48]

    2024 , eprint=

    RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval , author=. 2024 , eprint=

  49. [49]

    2025 , eprint=

    Retrieval-Augmented Generation with Graphs (GraphRAG) , author=. 2025 , eprint=

  50. [50]

    2025 , eprint=

    DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering , author=. 2025 , eprint=

  51. [51]

    2025 , eprint=

    Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO , author=. 2025 , eprint=

  52. [52]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  53. [53]

    2025 , eprint=

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step , author=. 2025 , eprint=

  54. [54]

    2025 , eprint=

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey , author=. 2025 , eprint=

  55. [55]

    2025 , eprint=

    ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration , author=. 2025 , eprint=

  56. [56]

    2025 , eprint=

    Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL , author=. 2025 , eprint=

  57. [57]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  58. [58]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild , author=. arXiv preprint arXiv:2503.18892 , year=

  59. [59]

    arXiv preprint arXiv:2503.10460 , year=

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond , author=. arXiv preprint arXiv:2503.10460 , year=

  60. [60]

    Instruct

    Wenliang Dai and Junnan Li and Dongxu Li and Anthony Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven Hoi , booktitle=. Instruct. 2023 , url=

  61. [61]

    ICLR , year=

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , author=. ICLR , year=

  62. [62]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  63. [63]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  64. [64]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Vision-r1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

  65. [65]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  66. [66]

    arXiv preprint arXiv:1612.00410 , year=

    Deep variational information bottleneck , author=. arXiv preprint arXiv:1612.00410 , year=

  67. [67]

    The information bottleneck method

    The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

  68. [68]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  69. [69]

    1971 , publisher=

    The optimal control of partially observable Markov processes , author=. 1971 , publisher=

  70. [70]

    2025 , eprint=

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning , author=. 2025 , eprint=

  71. [71]

    2024 , eprint=

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=

  72. [72]

    Proceedings of CVPR , year=

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI , author=. Proceedings of CVPR , year=

  73. [73]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=

  74. [74]

    2025 , eprint=

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , eprint=

  75. [75]

    2023 , eprint=

    Document Understanding Dataset and Evaluation (DUDE) , author=. 2023 , eprint=

  76. [76]

    2023 , eprint=

    SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images , author=. 2023 , eprint=

  77. [77]

    2024 , eprint=

    MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations , author=. 2024 , eprint=

  78. [78]

    2025 , eprint=

    Reverse-Engineered Reasoning for Open-Ended Generation , author=. 2025 , eprint=

  79. [79]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  80. [80]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

Showing first 80 references.