UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Pith reviewed 2026-05-10 11:07 UTC · model grok-4.3
The pith
UniDoc-RL trains a vision-language model agent to refine visual evidence step by step from whole documents down to cropped image regions using reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space that progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, it introduces a dense multi-reward scheme that provides task-aware supervision for each action based on Group Relative Policy Optimization without a separate value network, supported by a curated dataset of high-quality reasoning trajectories with fine-grained action annotations.
What carries the argument
Hierarchical action space from coarse document retrieval to fine image cropping together with a dense multi-reward scheme that supervises every intermediate decision.
If this is right
- The single agent framework jointly executes retrieval, reranking, active visual perception, and reasoning instead of treating them as separate stages.
- The method produces up to 17.7 percent higher accuracy than earlier reinforcement learning approaches on three standard benchmarks.
- Training succeeds by using curated trajectories that carry fine-grained action annotations rather than only final-answer labels.
- Group Relative Policy Optimization aligns the agent's behavior with several objectives at once without requiring an additional value network.
Where Pith is reading between the lines
- If the coarse-to-fine refinement pattern transfers, similar hierarchical actions could improve performance on visual question answering tasks that contain many distracting image regions.
- The framework points toward replacing modular retrieval pipelines in retrieval-augmented generation systems with a single learned policy that decides when and where to look.
- Scaling the approach would depend on finding cheaper ways to generate the required fine-grained action annotations at larger volumes.
Load-bearing premise
The dense multi-reward scheme and hierarchical action space can be designed and annotated in a way that genuinely teaches the agent to suppress irrelevant content without introducing reward hacking or dataset-specific biases that fail to generalize.
What would settle it
Run the trained agent on a fresh benchmark containing document layouts and visual noise patterns absent from the training trajectories and measure whether the reported gains over baselines shrink or disappear.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) extends Large Vision-Language Models (LVLMs) with external visual knowledge. However, existing visual RAG systems typically rely on generic retrieval signals that overlook the fine-grained visual semantics essential for complex reasoning. To address this limitation, we propose UniDoc-RL, a unified reinforcement learning framework in which an LVLM agent jointly performs retrieval, reranking, active visual perception, and reasoning. UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space. Specifically, it progressively refines visual evidence from coarse-grained document retrieval to fine-grained image selection and active region cropping, allowing the model to suppress irrelevant content and attend to information-dense regions. For effective end-to-end training, we introduce a dense multi-reward scheme that provides task-aware supervision for each action. Based on Group Relative Policy Optimization (GRPO), UniDoc-RL aligns agent behavior with multiple objectives without relying on a separate value network. To support this training paradigm, we curate a comprehensive dataset of high-quality reasoning trajectories with fine-grained action annotations. Experiments on three benchmarks demonstrate that UniDoc-RL consistently surpasses state-of-the-art baselines, yielding up to 17.7% gains over prior RL-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UniDoc-RL, a unified reinforcement learning framework for visual RAG in LVLMs. It models information acquisition as a sequential decision process with a hierarchical action space progressing from coarse document retrieval to fine-grained image selection and region cropping. Training uses a dense multi-reward scheme providing task-aware supervision at each stage, optimized via Group Relative Policy Optimization (GRPO) on a curated dataset of annotated reasoning trajectories. Experiments on three benchmarks report consistent outperformance of SOTA baselines, with gains up to 17.7% over prior RL-based methods.
Significance. If the empirical gains are robust, the work could meaningfully advance visual RAG by enabling LVLMs to actively suppress irrelevant content through hierarchical perception and multi-objective rewards, rather than relying on generic retrieval. The GRPO formulation without a separate value network and the emphasis on dense, stage-specific rewards represent a practical approach to multi-task alignment in this domain. The curated trajectory dataset is a useful contribution for future research in visual decision-making agents.
major comments (3)
- [§5] §5 (Experiments): The central claim of up to 17.7% gains over prior RL methods is presented without ablation studies isolating the contribution of each dense reward term (coarse retrieval, image selection, cropping, reasoning) or analysis of potential reward hacking. Given the absence of a value network in GRPO, this omission leaves open whether the improvements stem from genuine policy learning or from reward misspecification correlated with benchmark artifacts.
- [§4.2] §4.2 (Reward Design): The dense multi-reward scheme is described at a high level but lacks explicit formulation or weighting details for how task-aware supervision is computed at each hierarchical stage. Without this, it is difficult to assess whether the rewards genuinely teach suppression of irrelevant content or introduce dataset-specific biases that fail to generalize beyond the three evaluation benchmarks.
- [§5.3] §5.3 (Baselines and Statistics): Performance tables report point estimates of improvement but provide no error bars, multiple random seeds, or statistical significance tests. This weakens the assertion of 'consistent' surpassing of SOTA, especially since the method's sensitivity to reward shaping (noted in the GRPO setup) could amplify variance.
minor comments (2)
- [Introduction] The abstract and introduction could more clearly distinguish the proposed hierarchical action space from prior coarse-to-fine retrieval methods in the related work section.
- [§3] Notation for the action space and reward components is introduced without a dedicated table summarizing symbols, which would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, agreeing that the suggested additions will strengthen the empirical claims, and we commit to incorporating them in the revised version.
read point-by-point responses
-
Referee: [§5] §5 (Experiments): The central claim of up to 17.7% gains over prior RL methods is presented without ablation studies isolating the contribution of each dense reward term (coarse retrieval, image selection, cropping, reasoning) or analysis of potential reward hacking. Given the absence of a value network in GRPO, this omission leaves open whether the improvements stem from genuine policy learning or from reward misspecification correlated with benchmark artifacts.
Authors: We acknowledge that the current manuscript does not include explicit ablations isolating each reward term or direct analysis of reward hacking. We will add a dedicated ablation subsection in §5 that systematically removes individual reward components (coarse retrieval, image selection, cropping, reasoning) and reports the resulting performance drops on all three benchmarks. We will also include qualitative trajectory analysis showing the agent's progressive suppression of irrelevant regions. Regarding GRPO, we will expand the method section to clarify that group-relative comparisons across sampled trajectories provide stable policy gradients without a critic network; the curated annotations in our trajectory dataset further anchor the rewards to ground-truth reasoning steps, reducing the risk of misspecification. revision: yes
-
Referee: [§4.2] §4.2 (Reward Design): The dense multi-reward scheme is described at a high level but lacks explicit formulation or weighting details for how task-aware supervision is computed at each hierarchical stage. Without this, it is difficult to assess whether the rewards genuinely teach suppression of irrelevant content or introduce dataset-specific biases that fail to generalize beyond the three evaluation benchmarks.
Authors: We agree that the reward design requires more precise specification. In the revision we will provide the full mathematical definitions for each stage-specific reward (e.g., IoU-based cropping reward, relevance score for image selection, and reasoning accuracy reward), together with the exact weighting coefficients and normalization procedure. We will also add a short sensitivity study varying the weights and report performance on held-out data to address concerns about dataset-specific bias and generalization. revision: yes
-
Referee: [§5.3] §5.3 (Baselines and Statistics): Performance tables report point estimates of improvement but provide no error bars, multiple random seeds, or statistical significance tests. This weakens the assertion of 'consistent' surpassing of SOTA, especially since the method's sensitivity to reward shaping (noted in the GRPO setup) could amplify variance.
Authors: We recognize that point estimates alone are insufficient for robust claims. We will rerun all experiments with at least three random seeds, report mean and standard deviation as error bars in the tables, and include statistical significance tests (paired t-test or Wilcoxon signed-rank) between UniDoc-RL and the strongest baselines. These additions will directly quantify variance and support the consistency of the reported gains. revision: yes
Circularity Check
No significant circularity; new RL framework with external benchmark evaluation
full rationale
The paper presents UniDoc-RL as a novel unified RL framework that formulates visual RAG as hierarchical sequential decision-making, introduces a dense multi-reward scheme, uses GRPO for optimization, and curates a new trajectory dataset for training. Performance is measured via gains on three external benchmarks (up to 17.7% over prior RL methods). No equations or derivations are provided in the abstract that reduce predictions or results to inputs by construction. No load-bearing self-citations, self-definitional steps, or fitted inputs renamed as predictions are evident. The derivation chain is self-contained as an empirical training framework rather than tautological.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Visual information acquisition in RAG can be formulated as a sequential decision-making problem with a hierarchical action space.
- domain assumption Dense multi-reward signals can provide effective task-aware supervision for each action without requiring a separate value network.
Reference graph
Works this paper leans on
-
[8]
judge" to True. Otherwise, please set
**Iterative Refinement** Repeat Steps 3-6 as needed until sufficient information is gathered to provide a complete and accurate final answer. **CRITICAL CLARIFICATION ON MAGNIFICATION:**- Request magnification ONLY when enlarging the specific region of the selected image might help answer the query more clearly and accurately- Use `<bbox>` coordinates to speci...
-
[16]
judge" to True. Otherwise, please set
**Iterative Refinement** Repeat Steps 3-6 as needed until sufficient information is gathered to provide a complete and accurate final answer. **CRITICAL CLARIFICATION ON MAGNIFICATION:**- Request magnification ONLY when enlarging the specific region of the selected image might help answer the query more clearly and accurately- Use `<bbox>` coordinates to speci...
-
[24]
judge" to True. Otherwise, please set
**Iterative Refinement** Repeat Steps 3-6 as needed until sufficient information is gathered to provide a complete and accurate final answer. **CRITICAL CLARIFICATION ON MAGNIFICATION:**- Request magnification ONLY when enlarging the specific region of the selected image might help answer the query more clearly and accurately- Use `<bbox>` coordinates to speci...
-
[32]
judge" to True. Otherwise, please set
**Iterative Refinement** Repeat Steps 3-6 as needed until sufficient information is gathered to provide a complete and accurate final answer. **CRITICAL CLARIFICATION ON MAGNIFICATION:**- Request magnification ONLY when enlarging the specific region of the selected image might help answer the query more clearly and accurately- Use `<bbox>` coordinates to speci...
-
[33]
**`crop`** (boolean) - `true`: If you think **at least one** bounding box would benefit from cropping and zooming for answering the question. - `false`: Otherwise.2. **`bbox`** (array) - List the **full coordinates** of all bounding boxes you judge as **requiring cropping and zooming**, in the format `[[x1, y1, x2, y2], ...]`. - If `crop` is `false`, this ...
-
[34]
**Reason First** Upon receiving any new information, you **must** first perform reasoning within <think> and </think> tags to determine whether you already havesufficient information to answer the user's question directly
-
[35]
The system will return a set of images(indexed starting from 0)
**Initiate a Search (if needed)** If reasoning indicates that necessary information is missing, initiate an image search using <search>query</search>. The system will return a set of images(indexed starting from 0)
-
[36]
**Image Analysis and Selection** - Within the <think> tags, **analyze each image one by one**, focusing on whether it contains **text, charts, tables, labels, or other key visualinformation** relevant to the query. - Evaluate the completeness and relevance of information in each image, and **select the single image most helpful for answering the query**. ...
-
[37]
**Detailed Image Content Analysis** **After selecting an image, proceed with this two-stage analysis:** ### 4.1 Initial ROI Identification and Quality Assessment Within `<think>` tags: - Identify specific regions of interest (ROI) with coordinates `[x1, y1, x2, y2]` relevant to the query - For each region, summarize its key semantic content - **Assess reada...
-
[38]
**Magnified Image Processing (If Applicable)** **Only execute if magnification was requested in Step 4:** - Upon receiving magnified image(s), analyze within `<think>` tags - Extract relevant information from the high-resolution regions - Output findings in `<information>Based on magnified image analysis</information>`
-
[39]
**Final Answer Determination** Within `<think>` tags, evaluate: - Is current information sufficient to answer the original question? - What critical information (if any) remains missing? **Output exactly one:** - If insufficient: `<search>precise follow-up query</search>` - If sufficient: `<answer>direct final answer</answer>`
-
[40]
judge" to True. Otherwise, please set
**Iterative Refinement** Repeat Steps 3-6 as needed until sufficient information is gathered to provide a complete and accurate final answer. **CRITICAL CLARIFICATION ON MAGNIFICATION:**- Request magnification ONLY when enlarging the specific region of the selected image might help answer the query more clearly and accurately- Use `<bbox>` coordinates to speci...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.