LEMUR: Learned Multi-Vector Retrieval
Pith reviewed 2026-05-22 11:51 UTC · model grok-4.3
The pith
LEMUR reduces multi-vector retrieval to single-vector search using a learned one-hidden-layer network.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LEMUR formulates multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network, then reduces inference under this model to single-vector similarity search in its latent space, enabling the use of existing single-vector search indexes to accelerate retrieval.
What carries the argument
A one-hidden-layer neural network trained on MaxSim targets whose latent space turns the multi-vector problem into ordinary single-vector similarity search.
Load-bearing premise
The one-hidden-layer network and its latent-space reduction preserve retrieval quality comparable to direct MaxSim computation on the original token embeddings.
What would settle it
Side-by-side measurement of recall or NDCG and query latency on a standard benchmark such as MS MARCO for LEMUR versus an exact MaxSim baseline would show whether quality holds while latency drops by roughly ten times.
Figures
read the original abstract
Multi-vector representations generated by late interaction models, such as ColBERT, enable superior retrieval quality compared to single-vector representations in information retrieval applications. In multi-vector retrieval systems, both queries and documents are encoded using one embedding per token, and similarity between queries and documents is measured by the MaxSim similarity measure. However, the improved quality of multi-vector retrieval comes at the expense of significantly increased search latency. In this work, we introduce LEMUR, a simple yet efficient framework for multi-vector similarity search. LEMUR consists of two consecutive problem reductions: First, we formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space, enabling the use of existing single-vector search indexes to accelerate retrieval. LEMUR is an order of magnitude faster than prior multi-vector similarity search methods. Our code is available at https://github.com/ejaasaari/lemur
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LEMUR, a framework that reduces multi-vector similarity search (e.g., MaxSim over token embeddings as in ColBERT) to a supervised learning problem solved by a one-hidden-layer neural network, followed by a second reduction that maps inference to single-vector similarity search in the network's latent space. This enables reuse of existing single-vector indexes and yields an order-of-magnitude latency improvement over prior multi-vector methods while aiming to preserve retrieval quality.
Significance. If the empirical results confirm that the learned reduction preserves document rankings at a level comparable to direct MaxSim, the work would provide a practical bridge between high-quality late-interaction models and efficient single-vector infrastructure. The availability of code at the cited GitHub repository is a positive contribution to reproducibility.
major comments (2)
- [§3] §3 (Method, second reduction): The claim that single-vector search in the latent space produces rankings equivalent to the original MaxSim computation is load-bearing for the speedup result. The manuscript should report a quantitative measure of ranking agreement (e.g., Kendall-tau or fraction of queries with identical top-k ordering) between LEMUR and direct MaxSim on the same queries and corpus.
- [§4] §4 (Experiments): The reported latency gains are measured against prior multi-vector baselines; however, without side-by-side effectiveness numbers (MRR@10, Recall@1000, or nDCG) on standard benchmarks such as MS MARCO or BEIR, it is impossible to verify that the operating point remains comparable to the original multi-vector system.
minor comments (2)
- [Abstract / §3.1] The abstract states that the approach is 'simple yet efficient' but does not define the training objective or loss used for the one-hidden-layer network; this notation should be introduced in §3.1 with an equation.
- [Figure 1] Figure 1 (or equivalent diagram) would benefit from explicit arrows showing the flow from token embeddings through the hidden layer to the latent single-vector representation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of empirical validation that will strengthen the presentation of LEMUR. We address each major comment below and commit to incorporating the suggested analyses in the revised version.
read point-by-point responses
-
Referee: [§3] §3 (Method, second reduction): The claim that single-vector search in the latent space produces rankings equivalent to the original MaxSim computation is load-bearing for the speedup result. The manuscript should report a quantitative measure of ranking agreement (e.g., Kendall-tau or fraction of queries with identical top-k ordering) between LEMUR and direct MaxSim on the same queries and corpus.
Authors: We agree that a quantitative measure of ranking agreement would strengthen the paper. To clarify, the second reduction is constructed to be exactly equivalent to inference under the learned one-hidden-layer neural network (i.e., the approximation to MaxSim), rather than claiming exact equivalence to the original MaxSim computation itself. Nevertheless, we will add a new analysis in the revised manuscript that reports Kendall's tau and the fraction of queries with identical top-k orderings between LEMUR's latent-space search and direct MaxSim on the evaluation sets. This will allow readers to assess the end-to-end fidelity of the overall approach. revision: yes
-
Referee: [§4] §4 (Experiments): The reported latency gains are measured against prior multi-vector baselines; however, without side-by-side effectiveness numbers (MRR@10, Recall@1000, or nDCG) on standard benchmarks such as MS MARCO or BEIR, it is impossible to verify that the operating point remains comparable to the original multi-vector system.
Authors: We thank the referee for this observation. While the current experiments already compare LEMUR against multi-vector baselines on both latency and retrieval quality, we will revise the experimental section to include explicit side-by-side tables reporting MRR@10, Recall@1000, and nDCG@10 for LEMUR, the original MaxSim implementation, and other baselines on MS MARCO and BEIR. This will make the comparability of the operating point fully transparent. revision: yes
Circularity Check
No significant circularity; reductions are modeling choices with external leverage
full rationale
The paper presents LEMUR as two explicit problem reductions: casting multi-vector MaxSim search as a supervised one-hidden-layer network learning task, followed by mapping inference to single-vector search in the resulting latent space to reuse existing indexes. These steps are algorithmic reformulations and modeling decisions trained on external data, not self-referential definitions or fitted parameters renamed as predictions. The claimed order-of-magnitude speedup derives from leveraging independent single-vector indexes rather than any loop back to the paper's own inputs or self-citations. No load-bearing uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results appear in the derivation chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- one-hidden-layer neural network parameters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first formulate multi-vector similarity search as a supervised learning problem that can be solved using a one-hidden-layer neural network. Second, we reduce inference under this model to single-vector similarity search in its latent space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LFM2 technical report.arXiv preprint arXiv:2511.23404,
Amini, A., Banaszak, A., Benoit, H., B ¨o¨ok, A., Dakhran, T., Duong, S., Eng, A., Fernandes, F., H ¨ark¨onen, M., Harrington, A., et al. LFM2 technical report.arXiv preprint arXiv:2511.23404,
-
[2]
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Clavi´e, B., Chaffin, A., and Adams, G. Reducing the footprint of multi-vector retrieval with minimal per- formance impact via token pooling.arXiv preprint arXiv:2409.14683,
-
[4]
K., Mohr, I., Ungureanu, A., Wang, B., Eslami, S., Martens, S., Werk, M., Wang, N., et al
G¨unther, M., Sturua, S., Akram, M. K., Mohr, I., Ungureanu, A., Wang, B., Eslami, S., Martens, S., Werk, M., Wang, N., et al. jina-embeddings-v4: Universal embeddings for multimodal multilingual retrieval. InProceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pp. 531–550,
work page 2025
-
[5]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs).arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
VIBE: Vector index benchmark for embed- dings.arXiv preprint arXiv:2505.17810,
J¨a¨asaari, E., Hyv ¨onen, V ., Ceccarello, M., Roos, T., and Aum¨uller, M. VIBE: Vector index benchmark for embed- dings.arXiv preprint arXiv:2505.17810,
-
[7]
Jha, R., Wang, B., G¨unther, M., Mastrapas, G., Sturua, S., Mohr, I., Koukounas, A., Wang, M. K., Wang, N., and Xiao, H. Jina-ColBERT-v2: A general-purpose multi- lingual late interaction retriever. InProceedings of the F ourth Workshop on Multilingual Representation Learn- ing (MRL 2024), pp. 159–166. Association for Computa- tional Linguistics,
work page 2024
-
[8]
Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Loison, A., Mac´e, Q., Edy, A., Xing, V ., Balough, T., Mor- eira, G., Liu, B., Faysse, M., Hudelot, C., and Viaud, G. ViDoRe V3: A comprehensive evaluation of retrieval augmented generation in complex real-world scenarios. arXiv preprint arXiv:2601.08620,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
PLAID: an efficient engine for late interaction retrieval
Santhanam, K., Khattab, O., Potts, C., and Zaharia, M. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1747– 1756, 2022a. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. ColBERTv2: Effective and efficient retrieval via light...
work page 2022
-
[11]
Takehi, R., Clavi´e, B., Lee, S., and Shakir, A
URL https://github.com/ lightonai/fast-plaid. Takehi, R., Clavi´e, B., Lee, S., and Shakir, A. Fantastic (small) retrievers and how to train them: mxbai-edge- colbert-v0 tech report.arXiv preprint arXiv:2510.14880,
-
[12]
ModernVBERT: To- wards smaller visual document retrievers.arXiv preprint arXiv:2510.01149,
Teiletche, P., Mac ´e, Q., Conti, M., Loison, A., Viaud, G., Colombo, P., and Faysse, M. ModernVBERT: To- wards smaller visual document retrievers.arXiv preprint arXiv:2510.01149,
-
[13]
Veneroso, J., Jayaram, R., Rao, J., ´Abrego, G. H., Hadian, M., and Cer, D. CRISP: Clustering multi-vector rep- resentations for denoising and pruning.arXiv preprint arXiv:2505.11471,
-
[14]
Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E
URL https://github.com/zilliztech/ pyglass. Xu, M., Moreira, G., Ak, R., Osmulski, R., Babakhin, Y ., Yu, Z., Schifferer, B., and Oldridge, E. Llama nemoretriever colembed: Top-performing text-image retrieval model. arXiv preprint arXiv:2507.05513,
-
[15]
Yan, Y ., Xu, G., Zou, X., Liu, S., Kwok, J., and Hu, X. DocPruner: A storage-efficient framework for multi- vector visual document retrieval via adaptive patch-level embedding pruning.arXiv preprint arXiv:2509.23883,
-
[16]
For discussion, see Sec. 6.2. 250 500 750 1000 1250 1500 1750 2000 candidates (k′) 0.4 0.6 0.8 1.0 recall100@k′ msmarco-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 250 500 750 1000 1250 1500 1750 2000 candidates (k′) 0.4 0.6 0.8 1.0 recall100@k′ hotpotqa-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 250 500 750 1000 1250 1500 1750 2000 c...
work page 2000
-
[17]
End-to-end performance additional results C.1
0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ msmarco-colbert 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ hotpotqa-colbert LEMUR 4096 LEMUR 2048 LEMUR 1024 MUVERA 10240 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ nq-colbert 0 100 200 300 400 500 candidates (k′) 0.4 0.6 0.8 1.0 recall10@k′ quo...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.