pith. sign in

arxiv: 2510.22215 · v2 · submitted 2025-10-25 · 💻 cs.IR · cs.CV

Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy

Pith reviewed 2026-05-18 04:26 UTC · model grok-4.3

classification 💻 cs.IR cs.CV
keywords hybrid vector retrievalvisually rich documentsdocument retrievalsingle-vector retrievalmulti-vector retrievalVS-PagesViMDoc benchmarkinformation retrieval
0
0 comments X

The pith

A two-stage hybrid system for visually rich documents recovers 99.87 percent of multi-vector recall while cutting per-query computation by 99.82 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HEAVEN, a plug-and-play framework that first runs an efficient single-vector retriever over Visually-Summarized Pages assembled from representative layouts across multiple pages, then reranks the short list of candidates with a multi-vector method after dropping low-importance query tokens. It also releases the ViMDoc benchmark to test retrieval under realistic conditions involving long, multi-document, visually complex material. A sympathetic reader would care because many practical tasks such as legal discovery, scientific literature search, and enterprise knowledge management rely on documents whose meaning depends on layout, tables, and images. If the central claim holds, high-accuracy retrieval becomes feasible at the speed of single-vector systems rather than requiring the full cost of multi-vector processing on every query.

Core claim

HEAVEN attains 99.87 percent of the Recall@1 performance of multi-vector models on average across four benchmarks while reducing per-query computation by 99.82 percent. It achieves this by first retrieving candidate pages with a single-vector method applied to Visually-Summarized Pages that combine representative visual layouts from multiple pages, then reranking those candidates with a multi-vector method that filters query tokens by linguistic importance.

What carries the argument

Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages to support fast single-vector candidate retrieval in the first stage, paired with linguistic-importance filtering of query tokens during the multi-vector reranking stage.

Load-bearing premise

Visually-Summarized Pages preserve enough visual and semantic information that the single-vector first stage surfaces nearly all relevant candidates without excessive false negatives.

What would settle it

A measurement showing that the first-stage single-vector search over VS-Pages misses a substantial share of the relevant pages that a full multi-vector search would return on ViMDoc or the other three benchmarks.

read the original abstract

Retrieval over visually rich documents is essential for tasks such as legal discovery, scientific search, and enterprise knowledge management. Existing approaches fall into two paradigms: single-vector retrieval, which is efficient but coarse, and multi-vector retrieval, which is accurate but computationally expensive. To address this trade-off, we propose HEAVEN, a plug-and-play two-stage hybrid-vector framework. In the first stage, HEAVEN efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages (VS-Pages), which assemble representative visual layouts from multiple pages. In the second stage, it reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations. To evaluate retrieval systems under realistic conditions, we also introduce ViMDoc, a benchmark for visually rich, multi-document, and long-document retrieval. Across four benchmarks, HEAVEN attains 99.87% of the Recall@1 performance of multi-vector models on average while reducing per-query computation by 99.82%, achieving efficiency and accuracy. Our code and datasets are available at: https://github.com/juyeonnn/HEAVEN

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HEAVEN, a plug-and-play two-stage hybrid-vector retrieval framework for visually rich documents. Stage one uses single-vector retrieval over Visually-Summarized Pages (VS-Pages) that aggregate representative visual layouts from multiple original pages. Stage two reranks the candidates with a multi-vector model after filtering query tokens by linguistic importance. The authors introduce the ViMDoc benchmark for multi-document and long-document retrieval and report that, across four benchmarks, HEAVEN attains 99.87% of the Recall@1 of full multi-vector models on average while reducing per-query computation by 99.82%.

Significance. If the central empirical claim holds after verification of first-stage recall, the work would provide a concrete, deployable compromise between the efficiency of single-vector and the accuracy of multi-vector retrieval for visually complex documents. The 99.82% compute reduction at near-parity Recall@1 would be practically significant for legal discovery, scientific search, and enterprise settings. The open-sourced code and the new ViMDoc benchmark are additional strengths that could facilitate follow-on research.

major comments (2)
  1. [§4] §4 (Experimental Results): The headline claim that HEAVEN reaches 99.87% of multi-vector Recall@1 is not supported by any reported first-stage Recall@K figures for the single-vector retriever operating on VS-Pages. Because the multi-vector reranker only sees candidates that survive stage one, the absence of these metrics leaves open the possibility that the observed parity is an artifact of low false-negative rates on the chosen benchmarks rather than a general property of the VS-Page construction.
  2. [§3.2] §3.2 (VS-Page Construction): The description of how representative visual layouts are selected and assembled into each VS-Page does not specify the aggregation heuristic or the typical number of pages per summary. Without ablations that vary these choices and measure the resulting first-stage recall, it remains unclear whether page-specific visual cues that distinguish relevant documents are systematically lost, directly capping end-to-end performance.
minor comments (2)
  1. [§3] The token-filtering threshold based on linguistic importance is mentioned in the abstract and §3 but never given an explicit formula or default value; adding this detail would improve reproducibility.
  2. [Table 2] Table 2 (or equivalent results table) would benefit from an additional column reporting first-stage recall so readers can directly assess the contribution of each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Results): The headline claim that HEAVEN reaches 99.87% of multi-vector Recall@1 is not supported by any reported first-stage Recall@K figures for the single-vector retriever operating on VS-Pages. Because the multi-vector reranker only sees candidates that survive stage one, the absence of these metrics leaves open the possibility that the observed parity is an artifact of low false-negative rates on the chosen benchmarks rather than a general property of the VS-Page construction.

    Authors: We agree with the referee that reporting the first-stage Recall@K is necessary to substantiate the claim and to rule out the possibility that the high performance is benchmark-specific. In the revised manuscript, we will add a new table or subsection in §4 presenting the Recall@K results for the single-vector stage on VS-Pages for various K values across the four benchmarks. This will demonstrate the recall efficiency of the VS-Page approach. revision: yes

  2. Referee: [§3.2] §3.2 (VS-Page Construction): The description of how representative visual layouts are selected and assembled into each VS-Page does not specify the aggregation heuristic or the typical number of pages per summary. Without ablations that vary these choices and measure the resulting first-stage recall, it remains unclear whether page-specific visual cues that distinguish relevant documents are systematically lost, directly capping end-to-end performance.

    Authors: We acknowledge that the current description in §3.2 could benefit from greater specificity regarding the aggregation heuristic and the typical number of pages per VS-Page. We will revise this section to include these details. Additionally, we will add an ablation study varying the number of pages per summary and selection choices, measuring the impact on first-stage recall to confirm that important visual cues are preserved. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical hybrid retrieval framework

full rationale

The manuscript proposes HEAVEN as a plug-and-play two-stage architecture that first retrieves over constructed VS-Pages with single-vector methods and then reranks with filtered multi-vector methods; all reported performance figures (99.87 % Recall@1 parity and 99.82 % compute reduction) are obtained by direct measurement on four external benchmarks and the newly introduced ViMDoc dataset. No equations, fitted parameters, or uniqueness theorems appear in the provided text that would reduce any claimed result to the method's own inputs by construction. The framework is presented as an engineering combination of existing single- and multi-vector paradigms rather than a derivation whose central claim collapses into a self-citation chain or a renaming of known patterns. Consequently the evaluation remains externally falsifiable and the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central performance claim rests on the assumption that VS-Pages retain sufficient visual signal and that token filtering removes only redundant computation; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5749 in / 1250 out tokens · 28779 ms · 2026-05-18T04:26:19.146765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.