Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

Yao Zhang; Yu Xiao; Zhuchenyang Liu; Ziyu Hu

arxiv: 2601.20107 · v2 · pith:B2WWDVP4new · submitted 2026-01-27 · 💻 cs.CV · cs.CL· cs.IR

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

Zhuchenyang Liu , Ziyu Hu , Yao Zhang , Yu Xiao This is my paper

Pith reviewed 2026-05-22 11:10 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR

keywords visual document retrievalmulti-vector pruningtraining-free compressionstructural anchor pruningscore retentionvision-language modelstoken pruningstructural plateau

0 comments

The pith

Structural Anchor Pruning keeps over 90% retrieval quality after removing over 90% of visual tokens in document search models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Structural Anchor Pruning to compress multi-vector indexes used in visual document retrieval by vision-language models. It achieves this in a training-free and query-agnostic manner by diagnosing layer-wise score retention, selecting a suitable pruning window, and scoring central anchor patches. A sympathetic reader would care because this reduces the prohibitive storage costs of fine-grained retrieval without sacrificing much accuracy or requiring per-model training. The approach works across models with different layer counts by exploiting a stable structural region in the network.

Core claim

The central claim is that Structural Anchor Pruning can prune more than 90% of visual tokens while retaining over 90% of NDCG@5 on ViDoRe benchmarks for three different architectures, by locating the Structural Plateau where visual structure is preserved and using in-degree centrality to select anchors, rather than pruning in final layers where representations are query-aligned.

What carries the argument

SR-guided window selection and visual in-degree centrality scorer, which together identify the pruning region and the key patches to keep for maintaining retrieval performance without training.

If this is right

Index storage for visual document retrieval becomes feasible at much larger scales due to high compression ratios.
The pruning can be performed once at indexing time independently of any queries.
The technique generalizes to vision-language models with varying backbone depths without hyperparameter tuning per model.
Analysis of layer-wise behavior shows why pruning in final layers fails but intermediate structural regions succeed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This pruning strategy could be adapted to other retrieval tasks involving multi-vector representations.
The observed divergence between structure preservation and query alignment might inform compression methods in related vision-language applications.
Testing on even larger models or different document types would help confirm the generality of the structural plateau.

Load-bearing premise

The Score Retention metric and window selection procedure can identify an appropriate structural pruning region for any given backbone architecture without manual tuning or additional training.

What would settle it

Applying the method to a previously untested vision-language model backbone and measuring if the NDCG@5 score drops below 90% of the original when more than 90% of tokens are pruned.

read the original abstract

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAP gives a training-free way to prune most visual tokens in document retrieval models by finding a safe middle-layer window, with solid benchmark numbers but open questions on how automatic the window choice really is.

read the letter

The main thing to know is that Structural Anchor Pruning cuts more than 90 percent of visual tokens while keeping over 90 percent of NDCG@5 on ViDoRe v1/v2 across three different backbone sizes, all without training or per-model tuning. It does this by using Score Retention to locate a structural plateau in the layers and then picking anchor patches with in-degree centrality inside that window. The paper also offers a clear explanation for why final-layer pruning fails: later layers reshape the representation into a query-aligned form that loses the stable visual structure needed for safe compression. This Alignment-Aggregation Divergence is a useful mechanistic observation that prior heuristic methods did not articulate as directly. The results are the strongest part. Showing consistent behavior on 18-, 28-, and 36-layer models without any knobs adjusted per architecture is practical and directly addresses storage costs in visual document retrieval. The approach challenges the view that high-compression pruning must be query-dependent or trained. The soft spot is the SR-guided window selection. The claim that it automatically finds the right region for any backbone with zero hyperparameters rests on the procedure working for the three tested models, but the abstract does not spell out the exact detection rules or thresholds. If any part of the logic was chosen after inspecting those specific backbones, the training-free and generalizable guarantee would need extra evidence on unseen architectures. This is worth checking in the full experiments and ablations. The paper is aimed at engineers and researchers who need to shrink multi-vector indexes for large-scale visual document search without retraining models. Readers focused on efficiency techniques in vision-language retrieval would get concrete ideas they can test. It deserves peer review because the empirical gains are relevant and the layer analysis adds insight, even if the automaticity claim needs tighter validation.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Structural Anchor Pruning (SAP), a training-free and query-agnostic framework for compressing multi-vector indexes in visual document retrieval models. It comprises three components: Score Retention (SR) as a white-box per-layer diagnostic, SR-guided window selection to automatically identify a 'structural pruning region' (the 'Structural Plateau') without per-model hyperparameters, and a visual in-degree centrality scorer to select anchor patches within that window. On ViDoRe v1/v2 benchmarks across three architectures (18-, 28-, and 36-layer backbones), SAP retains over 90% of NDCG@5 while pruning more than 90% of visual tokens. The work also reports a layer-resolved analysis revealing an 'Alignment-Aggregation Divergence' that explains why intermediate layers preserve structure better than final layers.

Significance. If the central claims hold under verification, the result would be significant for practical deployment of fine-grained VDR systems, as it offers a storage-efficient alternative to query-dependent training while maintaining high retrieval quality. The mechanistic insight into layer-wise behavior and the parameter-free design across multiple backbones could influence compression strategies in vision-language retrieval more broadly.

major comments (2)

[Abstract and method overview] The core claim of being training-free and generalizable rests on SR-guided window selection automatically locating the structural pruning region for arbitrary backbones with zero per-model hyperparameters. However, the manuscript provides no formal definition of window boundaries, no explicit fixed thresholds or detection rules for SR drop-off, and no ablation demonstrating that the identical procedure succeeds on an unseen backbone without retuning. This is load-bearing for the 'no per-model parameter tuning' guarantee reported in the abstract.
[Experiments and results] The abstract asserts >90% NDCG@5 retention with >90% token pruning across the three tested architectures, but the provided details lack full experimental tables, per-model breakdowns, baseline comparisons, or robustness checks on the SR diagnostic. Without these, it is difficult to confirm that the performance holds beyond the reported conditions or that the SR metric is stable enough to support the window-selection procedure.

minor comments (1)

The newly introduced terms 'Structural Anchor' and 'Alignment-Aggregation Divergence' would benefit from explicit mathematical or algorithmic definitions early in the text to improve clarity for readers unfamiliar with the internal model computations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments help us improve the clarity and rigor of our presentation of Structural Anchor Pruning. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Abstract and method overview] The core claim of being training-free and generalizable rests on SR-guided window selection automatically locating the structural pruning region for arbitrary backbones with zero per-model hyperparameters. However, the manuscript provides no formal definition of window boundaries, no explicit fixed thresholds or detection rules for SR drop-off, and no ablation demonstrating that the identical procedure succeeds on an unseen backbone without retuning. This is load-bearing for the 'no per-model parameter tuning' guarantee reported in the abstract.

Authors: We appreciate the referee's focus on formalizing the window-selection procedure, which is central to our generalizability claim. Section 3.2 describes how SR-guided window selection identifies the Structural Plateau by locating the longest contiguous region of high SR before a pronounced drop-off, using a uniform, fixed detection rule applied identically to all backbones. To strengthen this, the revised manuscript will add a precise mathematical definition of the window boundaries (as the maximal interval [l, r] satisfying a fixed relative-change threshold on the SR sequence) together with pseudocode. The same fixed rule is used without retuning across the 18-, 28-, and 36-layer models, which already constitute three architecturally distinct backbones; we will expand the text to explicitly state that this uniform application constitutes the evidence for the parameter-free guarantee. While an experiment on a fourth, completely unseen backbone would provide additional support, the existing multi-backbone results demonstrate the procedure's robustness without per-model adjustments. revision: partial
Referee: [Experiments and results] The abstract asserts >90% NDCG@5 retention with >90% token pruning across the three tested architectures, but the provided details lack full experimental tables, per-model breakdowns, baseline comparisons, or robustness checks on the SR diagnostic. Without these, it is difficult to confirm that the performance holds beyond the reported conditions or that the SR metric is stable enough to support the window-selection procedure.

Authors: We thank the referee for noting the need for more granular experimental reporting. The manuscript already presents per-architecture NDCG@5 and pruning ratios together with baseline comparisons in the main results section and figures; however, we agree that fuller documentation will aid verification. In the revision we will move detailed per-model tables to the main text or a prominent appendix, add further baseline methods, and include explicit robustness analyses of the SR diagnostic (layer-wise variance, sensitivity to the fixed drop-off threshold, and stability across the three backbones). These additions will directly substantiate both the performance claims and the reliability of SR for automatic window selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on internal model diagnostics and empirical validation

full rationale

The paper's core procedure defines Score Retention (SR) directly from per-layer model computations on the input document tokens and uses an SR-guided window selection to identify a 'Structural Plateau' based on observed stability patterns across layers. This is not equivalent to its inputs by construction, nor does it rename a fitted parameter as a prediction. The generalization claim across 18/28/36-layer backbones is supported by reported benchmark results on ViDoRe v1/v2 rather than by self-definition or a self-citation chain that bears the load. The derivation chain remains self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the existence of a stable structural plateau identifiable without per-model tuning and the validity of the in-degree centrality measure for selecting anchors.

axioms (1)

domain assumption The document's visual structure is preserved as a stable Structural Plateau within the backbone layers before final query-aligned reshaping.
Invoked in the abstract's layer-resolved SR analysis as the mechanistic reason for SAP's success.

invented entities (2)

Structural Anchor no independent evidence
purpose: Key patches identified via visual in-degree centrality within the selected pruning window.
New concept introduced to guide token pruning decisions.
Alignment-Aggregation Divergence no independent evidence
purpose: Explains the difference between stable visual structure in middle layers and sparse query-aligned form in final layers.
Mechanistic insight derived from SR analysis to justify pruning region choice.

pith-pipeline@v0.9.0 · 5796 in / 1496 out tokens · 60980 ms · 2026-05-22T11:10:58.107360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the layer ensemble L∗ as a function of the model’s total depth Ltotal and relative depth hyperparameters α, β∈[0,1]: L∗(α, β) ={l∈N| ⌊α·L total⌋ ≤l≤ ⌊β·L total⌋}
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAP identifies semantic structural anchor patches by measuring the visual In-Degree Centrality of tokens within the Large language Model (LLM) backbone

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.