pith. sign in

arxiv: 2601.21853 · v2 · pith:5GNRWCVDnew · submitted 2026-01-29 · 💻 cs.IR · cs.LG

LEMUR: Learned Multi-Vector Retrieval

Pith reviewed 2026-05-22 11:51 UTC · model grok-4.3

classification 💻 cs.IR cs.LG
keywords multi-vector retrievallate interaction modelsMaxSimsimilarity searchneural networkinformation retrievallatency reductionColBERT
0
0 comments X

The pith

LEMUR reduces multi-vector retrieval to single-vector search using a learned one-hidden-layer network.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-vector representations from late interaction models like ColBERT improve retrieval quality over single-vector methods but incur high latency from computing MaxSim similarity across token embeddings. LEMUR addresses this by first casting multi-vector similarity search as a supervised learning problem solved with a one-hidden-layer neural network. It then reduces inference under the learned model to single-vector similarity search inside the network's latent space. This lets existing single-vector indexes accelerate the process, delivering an order of magnitude speedup. A sympathetic reader cares because the approach could make high-quality multi-vector retrieval practical at scale without custom hardware or slow exhaustive search.

Core claim

LEMUR formulates multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network, then reduces inference under this model to single-vector similarity search in its latent space, enabling the use of existing single-vector search indexes to accelerate retrieval.

What carries the argument

A one-hidden-layer neural network trained on MaxSim targets whose latent space turns the multi-vector problem into ordinary single-vector similarity search.

Load-bearing premise

The one-hidden-layer network and its latent-space reduction preserve retrieval quality comparable to direct MaxSim computation on the original token embeddings.

What would settle it

Side-by-side measurement of recall or NDCG and query latency on a standard benchmark such as MS MARCO for LEMUR versus an exact MaxSim baseline would show whether quality holds while latency drops by roughly ten times.

Figures

Figures reproduced from arXiv: 2601.21853 by Elias J\"a\"asaari, Teemu Roos, Ville Hyv\"onen.

Figure 1
Figure 1. Figure 1: A schematic overview of the query process (for indexing, see Sec. 3) in the LEMUR framework: The latent representations ψ(x) of the token-level embeddings x ∈ X are retrieved from the hidden layer of an MLP trained to estimate the MaxSim similarities between a query and each document. The single-vector representation Ψ(X) is obtained by pooling these latent representations. The k ′ most similar documents t… view at source ↗
Figure 2
Figure 2. Figure 2: Ablation study on the effect of the hidden layer size d ′ on the performance of LEMUR. Left: Comparison of recall100@k ′ for three values of d ′ as a function of the candidate set size k ′ . Right: Comparison of the end-to-end-performance comparison between different values of d ′ with ANNS and reranking included. While larger values of d ′ can yield more accurate estimates, the end-to-end performance gap … view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on the effect of using ANNS in LEMUR. ANNS significantly speeds up retrieval at recall levels < 0.95. (see Appendix B), the differences in performance are small. Hence, we use d ′ = 2048 in our end-to-end experiments to decrease the memory consumption. ANNS library vs. exact inference. Finally, we study the ef￾fect of using an ANNS index instead of evaluating the exact inner products (6) to … view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end performance comparison using ColBERTv2 embeddings on the HotpotQA (left) and MS MARCO (right) datasets. On both datasets, LEMUR is significantly faster than the baseline methods. Sec. 4. As a single-vector similarity search library, we use Glass (Wang, 2025), an efficient implementation of HNSW (Malkov & Yashunin, 2018) with scalar quantization. We implement the MaxSim reranking using C++. Basel… view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end performance comparison using four different modern multi-vector text models on the SCIDOCS dataset. On all datasets, LEMUR is significantly faster than the baseline methods, while especially MUVERA struggles on the non-ColBERTv2 models [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: End-to-end performance comparison using two visual document retrieval models on the ViDoRe dataset. LEMUR yields state-of-the-art performance compared to the baselines, but the gap is narrower than for the text models. footprint and leads to lower end-to-end latency due to faster single-vector similarity search. Standard scalar or product quantization (Jegou et al., 2011) techniques could be ap￾plied to th… view at source ↗
read the original abstract

Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding per token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved quality of multi-vector retrieval comes at the expense of significantly increased search latency. In this work, we introduce LEMUR, a simple yet efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: First, we formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, enabling the use of existing single-vector search indexes to accelerate retrieval. LEMUR is an order of magnitude faster than prior multi-vector similarity search methods. Our code is available at https://github.com/ejaasaari/lemur

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LEMUR, a framework that reduces multi-vector similarity search (e.g., MaxSim over token embeddings as in ColBERT) to a supervised learning problem solved by a one-hidden-layer neural network, followed by a second reduction that maps inference to single-vector similarity search in the network's latent space. This enables reuse of existing single-vector indexes and yields an order-of-magnitude latency improvement over prior multi-vector methods while aiming to preserve retrieval quality.

Significance. If the empirical results confirm that the learned reduction preserves document rankings at a level comparable to direct MaxSim, the work would provide a practical bridge between high-quality late-interaction models and efficient single-vector infrastructure. The availability of code at the cited GitHub repository is a positive contribution to reproducibility.

major comments (2)
  1. [§3] §3 (Method, second reduction): The claim that single-vector search in the latent space produces rankings equivalent to the original MaxSim computation is load-bearing for the speedup result. The manuscript should report a quantitative measure of ranking agreement (e.g., Kendall-tau or fraction of queries with identical top-k ordering) between LEMUR and direct MaxSim on the same queries and corpus.
  2. [§4] §4 (Experiments): The reported latency gains are measured against prior multi-vector baselines; however, without side-by-side effectiveness numbers (MRR@10, Recall@1000, or nDCG) on standard benchmarks such as MS MARCO or BEIR, it is impossible to verify that the operating point remains comparable to the original multi-vector system.
minor comments (2)
  1. [Abstract / §3.1] The abstract states that the approach is 'simple yet efficient' but does not define the training objective or loss used for the one-hidden-layer network; this notation should be introduced in §3.1 with an equation.
  2. [Figure 1] Figure 1 (or equivalent diagram) would benefit from explicit arrows showing the flow from token embeddings through the hidden layer to the latent single-vector representation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of empirical validation that will strengthen the presentation of LEMUR. We address each major comment below and commit to incorporating the suggested analyses in the revised version.

read point-by-point responses
  1. Referee: [§3] §3 (Method, second reduction): The claim that single-vector search in the latent space produces rankings equivalent to the original MaxSim computation is load-bearing for the speedup result. The manuscript should report a quantitative measure of ranking agreement (e.g., Kendall-tau or fraction of queries with identical top-k ordering) between LEMUR and direct MaxSim on the same queries and corpus.

    Authors: We agree that a quantitative measure of ranking agreement would strengthen the paper. To clarify, the second reduction is constructed to be exactly equivalent to inference under the learned one-hidden-layer neural network (i.e., the approximation to MaxSim), rather than claiming exact equivalence to the original MaxSim computation itself. Nevertheless, we will add a new analysis in the revised manuscript that reports Kendall's tau and the fraction of queries with identical top-k orderings between LEMUR's latent-space search and direct MaxSim on the evaluation sets. This will allow readers to assess the end-to-end fidelity of the overall approach. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported latency gains are measured against prior multi-vector baselines; however, without side-by-side effectiveness numbers (MRR@10, Recall@1000, or nDCG) on standard benchmarks such as MS MARCO or BEIR, it is impossible to verify that the operating point remains comparable to the original multi-vector system.

    Authors: We thank the referee for this observation. While the current experiments already compare LEMUR against multi-vector baselines on both latency and retrieval quality, we will revise the experimental section to include explicit side-by-side tables reporting MRR@10, Recall@1000, and nDCG@10 for LEMUR, the original MaxSim implementation, and other baselines on MS MARCO and BEIR. This will make the comparability of the operating point fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reductions are modeling choices with external leverage

full rationale

The paper presents LEMUR as two explicit problem reductions: casting multi-vector MaxSim search as a supervised one-hidden-layer network learning task, followed by mapping inference to single-vector search in the resulting latent space to reuse existing indexes. These steps are algorithmic reformulations and modeling decisions trained on external data, not self-referential definitions or fitted parameters renamed as predictions. The claimed order-of-magnitude speedup derives from leveraging independent single-vector indexes rather than any loop back to the paper's own inputs or self-citations. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the derivation chain.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only view reveals one main fitted component (neural network weights) and the assumption that the learned model approximates MaxSim without loss of quality; no explicit axioms or new entities are stated.

free parameters (1)
  • one-hidden-layer neural network parameters
    Weights of the network trained to solve the supervised formulation of multi-vector similarity.

pith-pipeline@v0.9.0 · 5707 in / 973 out tokens · 54391 ms · 2026-05-22T11:51:22.675368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 4 internal anchors

  1. [1]

    LFM2 technical report.arXiv preprint arXiv:2511.23404,

    Amini, A., Banaszak, A., Benoit, H., B ¨o¨ok, A., Dakhran, T., Duong, S., Eng, A., Fernandes, F., H ¨ark¨onen, M., Harrington, A., et al. LFM2 technical report.arXiv preprint arXiv:2511.23404,

  2. [2]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

  3. [3]

    Reducing the footprint of multi-vector retrieval with minimal per- formance impact via token pooling.arXiv preprint arXiv:2409.14683,

    Clavi´e, B., Chaffin, A., and Adams, G. Reducing the footprint of multi-vector retrieval with minimal per- formance impact via token pooling.arXiv preprint arXiv:2409.14683,

  4. [4]

    K., Mohr, I., Ungureanu, A., Wang, B., Eslami, S., Martens, S., Werk, M., Wang, N., et al

    G¨unther, M., Sturua, S., Akram, M. K., Mohr, I., Ungureanu, A., Wang, B., Eslami, S., Martens, S., Werk, M., Wang, N., et al. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pp. 531–550,

  5. [5]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,

  6. [6]

    VIBE: Vector index benchmark for embed- dings.arXiv preprint arXiv:2505.17810,

    J¨a¨asaari, E., Hyv ¨onen, V ., Ceccarello, M., Roos, T., and Aum¨uller, M. VIBE: Vector index benchmark for embed- dings.arXiv preprint arXiv:2505.17810,

  7. [7]

    K., Wang, N., and Xiao, H

    Jha, R., Wang, B., G¨unther, M., Mastrapas, G., Sturua, S., Mohr, I., Koukounas, A., Wang, M. K., Wang, N., and Xiao, H. Jina-ColBERT-v2: A general-purpose multi- lingual late interaction retriever. InProceedings of the F ourth Workshop on Multilingual Representation Learn- ing (MRL 2024), pp. 159–166. Association for Computa- tional Linguistics,

  8. [8]

    Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  9. [9]

    ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

    Loison, A., Mac´e, Q., Edy, A., Xing, V ., Balough, T., Mor- eira, G., Liu, B., Faysse, M., Hudelot, C., and Viaud, G. ViDoRe V3: A comprehensive evaluation of retrieval augmented generation in complex real-world scenarios. arXiv preprint arXiv:2601.08620,

  10. [10]

    PLAID: an efficient engine for late interaction retrieval

    Santhanam, K., Khattab, O., Potts, C., and Zaharia, M. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1747– 1756, 2022a. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. ColBERTv2: Effective and efficient retrieval via light...

  11. [11]

    Takehi, R., Clavi´e, B., Lee, S., and Shakir, A

    URL https://github.com/ lightonai/fast-plaid. Takehi, R., Clavi´e, B., Lee, S., and Shakir, A. Fantastic (small) retrievers and how to train them: mxbai-edge- colbert-v0 tech report.arXiv preprint arXiv:2510.14880,

  12. [12]

    ModernVBERT: To- wards smaller visual document retrievers.arXiv preprint arXiv:2510.01149,

    Teiletche, P., Mac ´e, Q., Conti, M., Loison, A., Viaud, G., Colombo, P., and Faysse, M. ModernVBERT: To- wards smaller visual document retrievers.arXiv preprint arXiv:2510.01149,

  13. [13]

    H., Hadian, M., and Cer, D

    Veneroso, J., Jayaram, R., Rao, J., ´Abrego, G. H., Hadian, M., and Cer, D. CRISP: Clustering multi-vector rep- resentations for denoising and pruning.arXiv preprint arXiv:2505.11471,

  14. [14]

    Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E

    URL https://github.com/zilliztech/ pyglass. Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E. Llama nemoretriever colembed: Top-performing text-image retrieval model. arXiv preprint arXiv:2507.05513,

  15. [15]

    DocPruner: A storage-efficient framework for multi- vector visual document retrieval via adaptive patch-level embedding pruning.arXiv preprint arXiv:2509.23883,

    Yan, Y ., Xu, G., Zou, X., Liu, S., Kwok, J., and Hu, X. DocPruner: A storage-efficient framework for multi- vector visual document retrieval via adaptive patch-level embedding pruning.arXiv preprint arXiv:2509.23883,

  16. [16]

    For discussion, see Sec. 6.2. 250 500 750 1000 1250 1500 1750 2000 candidates (k′) 0.4 0.6 0.8 1.0 recall100@k′ msmarco-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 250 500 750 1000 1250 1500 1750 2000 candidates (k′) 0.4 0.6 0.8 1.0 recall100@k′ hotpotqa-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 250 500 750 1000 1250 1500 1750 2000 c...

  17. [17]

    End-to-end performance additional results C.1

    0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ msmarco-colbert 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ hotpotqa-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ nq-colbert 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ quo...