Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy
Pith reviewed 2026-05-18 04:26 UTC · model grok-4.3
The pith
A two-stage hybrid system for visually rich documents recovers 99.87 percent of multi-vector recall while cutting per-query computation by 99.82 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HEAVEN attains 99.87 percent of the Recall@1 performance of multi-vector models on average across four benchmarks while reducing per-query computation by 99.82 percent. It achieves this by first retrieving candidate pages with a single-vector method applied to Visually-Summarized Pages that combine representative visual layouts from multiple pages, then reranking those candidates with a multi-vector method that filters query tokens by linguistic importance.
What carries the argument
Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages to support fast single-vector candidate retrieval in the first stage, paired with linguistic-importance filtering of query tokens during the multi-vector reranking stage.
Load-bearing premise
Visually-Summarized Pages preserve enough visual and semantic information that the single-vector first stage surfaces nearly all relevant candidates without excessive false negatives.
What would settle it
A measurement showing that the first-stage single-vector search over VS-Pages misses a substantial share of the relevant pages that a full multi-vector search would return on ViMDoc or the other three benchmarks.
read the original abstract
Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a plug-and-play two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDoc, a benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: https://github.com/juyeonnn/HEAVEN
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HEAVEN, a plug-and-play two-stage hybrid-vector retrieval framework for visually rich documents. Stage one uses single-vector retrieval over Visually-Summarized Pages (VS-Pages) that aggregate representative visual layouts from multiple original pages. Stage two reranks the candidates with a multi-vector model after filtering query tokens by linguistic importance. The authors introduce the ViMDoc benchmark for multi-document and long-document retrieval and report that, across four benchmarks, HEAVEN attains 99.87% of the Recall@1 of full multi-vector models on average while reducing per-query computation by 99.82%.
Significance. If the central empirical claim holds after verification of first-stage recall, the work would provide a concrete, deployable compromise between the efficiency of single-vector and the accuracy of multi-vector retrieval for visually complex documents. The 99.82% compute reduction at near-parity Recall@1 would be practically significant for legal discovery, scientific search, and enterprise settings. The open-sourced code and the new ViMDoc benchmark are additional strengths that could facilitate follow-on research.
major comments (2)
- [§4] §4 (Experimental Results): The headline claim that HEAVEN reaches 99.87% of multi-vector Recall@1 is not supported by any reported first-stage Recall@K figures for the single-vector retriever operating on VS-Pages. Because the multi-vector reranker only sees candidates that survive stage one, the absence of these metrics leaves open the possibility that the observed parity is an artifact of low false-negative rates on the chosen benchmarks rather than a general property of the VS-Page construction.
- [§3.2] §3.2 (VS-Page Construction): The description of how representative visual layouts are selected and assembled into each VS-Page does not specify the aggregation heuristic or the typical number of pages per summary. Without ablations that vary these choices and measure the resulting first-stage recall, it remains unclear whether page-specific visual cues that distinguish relevant documents are systematically lost, directly capping end-to-end performance.
minor comments (2)
- [§3] The token-filtering threshold based on linguistic importance is mentioned in the abstract and §3 but never given an explicit formula or default value; adding this detail would improve reproducibility.
- [Table 2] Table 2 (or equivalent results table) would benefit from an additional column reporting first-stage recall so readers can directly assess the contribution of each stage.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Results): The headline claim that HEAVEN reaches 99.87% of multi-vector Recall@1 is not supported by any reported first-stage Recall@K figures for the single-vector retriever operating on VS-Pages. Because the multi-vector reranker only sees candidates that survive stage one, the absence of these metrics leaves open the possibility that the observed parity is an artifact of low false-negative rates on the chosen benchmarks rather than a general property of the VS-Page construction.
Authors: We agree with the referee that reporting the first-stage Recall@K is necessary to substantiate the claim and to rule out the possibility that the high performance is benchmark-specific. In the revised manuscript, we will add a new table or subsection in §4 presenting the Recall@K results for the single-vector stage on VS-Pages for various K values across the four benchmarks. This will demonstrate the recall efficiency of the VS-Page approach. revision: yes
-
Referee: [§3.2] §3.2 (VS-Page Construction): The description of how representative visual layouts are selected and assembled into each VS-Page does not specify the aggregation heuristic or the typical number of pages per summary. Without ablations that vary these choices and measure the resulting first-stage recall, it remains unclear whether page-specific visual cues that distinguish relevant documents are systematically lost, directly capping end-to-end performance.
Authors: We acknowledge that the current description in §3.2 could benefit from greater specificity regarding the aggregation heuristic and the typical number of pages per VS-Page. We will revise this section to include these details. Additionally, we will add an ablation study varying the number of pages per summary and selection choices, measuring the impact on first-stage recall to confirm that important visual cues are preserved. revision: yes
Circularity Check
No circularity in empirical hybrid retrieval framework
full rationale
The manuscript proposes HEAVEN as a plug-and-play two-stage architecture that first retrieves over constructed VS-Pages with single-vector methods and then reranks with filtered multi-vector methods; all reported performance figures (99.87 % Recall@1 parity and 99.82 % compute reduction) are obtained by direct measurement on four external benchmarks and the newly introduced ViMDoc dataset. No equations, fitted parameters, or uniqueness theorems appear in the provided text that would reduce any claimed result to the method's own inputs by construction. The framework is presented as an engineering combination of existing single- and multi-vector paradigms rather than a derivation whose central claim collapses into a self-citation chain or a renaming of known patterns. Consequently the evaluation remains externally falsifiable and the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HEAVEN consists of two stages: (Stage 1) Single-Vector Retrieval of Candidate Pages … over Visually-Summarized Pages (VS-Pages) … (Stage 2) Multi-Vector Reranking of Pages … filtering query tokens by linguistic importance
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models … while reducing per-query computation by 99.82%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.