arxiv: 2604.28142 · v1 · submitted 2026-04-30 · 💻 cs.IR · cs.LG

Recognition: unknown

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Silvio Martinico , Franco Maria Nardini , Cosimo Rulli , Rossano Venturini

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:33 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords multivector retrievaltoken-aware clusteringhierarchical indexingproduct quantizationclustering efficiencyinformation retrievalvector searchretrieval speedup

0 comments

The pith

Token-aware clustering scales multivector retrieval to millions of centroids with 247× faster indexing and 9.8× quicker searches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multivector retrieval models represent documents at the token level for high effectiveness, but grouping their vectors with standard k-means is slow at large cluster counts and biases toward common tokens while neglecting rare ones. TACHIOM changes centroid allocation to account for each token's distribution in the data, which speeds clustering dramatically and yields clusters that support accurate scoring from centroids alone. The system then layers a graph-based index over those centroids and uses optimized product quantization for final scoring. Experiments on MS-MARCOv1 and LoTTE confirm the speed gains while effectiveness stays comparable or better. A reader cares because this removes a major barrier to deploying fine-grained retrieval at web scale.

Core claim

TACHIOM exploits token-level structure by accounting for tokens' distribution during centroid allocation. This enables scaling to millions of centroids, highly accurate document scoring using only centroids, and combines a graph-based index over centroids with an optimized Product Quantization layout. As a result, it achieves up to 247× faster clustering than k-means and up to 9.8× retrieval speedup over state-of-the-art systems on MS-MARCOv1 and LoTTE while maintaining comparable or superior effectiveness.

What carries the argument

Token-aware centroid allocation that incorporates token distribution to balance frequent and rare tokens, paired with a graph-based index over centroids and product quantization for final scoring.

If this is right

Multivector models can be indexed at larger scales because clustering no longer becomes the bottleneck.
Rare discriminative tokens receive better representation, which can improve accuracy on tail queries.
Document scoring can rely on centroids instead of full token-level vectors, cutting per-query computation.
The hierarchical graph index allows quick selection of candidate centroids before quantization-based refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-distribution principle could transfer to clustering other kinds of high-dimensional embeddings used in recommendation or classification.
Further memory savings might come from combining the same awareness with learned quantization or pruning techniques.
Long-term robustness would be tested by running the system on streaming collections where token distributions shift over time.
Similar awareness of item frequency could be added to other stages of the retrieval pipeline, such as query encoding or re-ranking.

Load-bearing premise

That incorporating token distribution into centroid allocation will continue to produce high-quality clusters at millions of centroids without hidden effectiveness losses on unseen data distributions or query types.

What would settle it

Retrieval effectiveness measured by NDCG or recall falling below current state-of-the-art baselines on a new large corpus with markedly different token frequencies, even when clustering remains fast, would show the quality claim does not hold at scale.

Figures

Figures reproduced from arXiv: 2604.28142 by Cosimo Rulli, Franco Maria Nardini, Rossano Venturini, Silvio Martinico.

**Figure 1.** Figure 1: Distance tables layout with 4 centroids per subspace. view at source ↗

**Figure 2.** Figure 2: Ms Marco-v1. Top plot: MRR@10 for 1,000 candidates reranking. Bottom plot: clustering time in minutes view at source ↗

read the original abstract

Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substantial computational and memory costs. Current solutions, based on the well-known k-means clustering algorithm, group similar vectors together to enable both effective compression and efficient retrieval. However, standard k-means scales poorly with the number of clusters and dataset size, and favours frequent tokens during training while underrepresenting rare, discriminative ones. In this work, we introduce TACHIOM, a multivector retrieval system that exploits token-level structure to significantly accelerate both clustering and retrieval. By accounting for tokens' distribution during centroid allocation, TACHIOM easily scales to millions of centroids, enabling highly accurate document scoring using only centroids, avoiding expensive token-level computation. TACHIOM combines a graph-based index over centroids with an optimized Product Quantization layout for efficient final scoring. Experiments on MS-MARCOv1 and LoTTE show that TACHIOM achieves up to $247\times$ faster clustering than k-means and up to $9.8\times$ retrieval speedup over state-of-the-art systems while maintaining comparable or superior effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TACHIOM's token-aware centroid allocation delivers clear clustering and retrieval speedups on the tested sets, but the cluster quality at million-centroid scale rests on limited evidence.

read the letter

The main thing to know is that this paper gives a practical fix for scaling multivector retrieval: instead of plain k-means, TACHIOM factors in token distribution when placing centroids. That change, plus a graph index over the centroids and tuned product quantization, is what lets them push to millions of centroids while scoring documents from the centroids alone. On MS-MARCOv1 and LoTTE they report 247× faster clustering than k-means and up to 9.8× faster retrieval than prior systems, with effectiveness that is at least comparable. If those numbers are stable, it is a useful engineering step for anyone who wants token-level models in production without the usual cost blow-up. The approach is straightforward and directly targets the two problems the abstract names—slow clustering at high centroid counts and the bias toward frequent tokens—so the core claim is easy to understand and test. The hierarchical layout with graph search plus PQ also looks like a reasonable way to keep final scoring cheap. The soft spot is exactly the one the stress-test flags: the paper only shows results on two standard collections and does not appear to include ablations on rare-token recall, cluster purity, or out-of-distribution queries as the number of centroids grows. Without those, it is still possible that the token-aware rule quietly loses coverage on discriminative but infrequent terms at the largest scales. The evaluation is empirical rather than circular, which is fine, but it leaves the generalization question open. This is the sort of systems paper that would interest people running large-scale retrieval who already use multivector representations. A reader who needs concrete indexing tricks for late-interaction models will get value from the speed numbers and the implementation choices. It is worth sending to peer review; the idea is concrete, the problem matters, and a referee can ask for the missing scale and rarity checks without much trouble.

Referee Report

2 major / 3 minor

Summary. The paper introduces TACHIOM, a multivector retrieval system that replaces standard k-means with a token-aware centroid allocation heuristic during clustering. By incorporating token distribution statistics into centroid assignment, the method scales to millions of centroids, enabling approximate document scoring using only centroids. TACHIOM further combines a graph-based index over these centroids with an optimized Product Quantization layout. On MS-MARCOv1 and LoTTE, it reports up to 247× faster clustering than k-means and up to 9.8× retrieval latency reduction versus prior state-of-the-art multivector systems while preserving or improving effectiveness metrics such as MRR@10 and Recall@1000.

Significance. If the reported speed/effectiveness trade-off holds under controlled conditions, the work would be a meaningful engineering contribution to the IR community. Multivector models remain the effectiveness frontier for passage retrieval, yet their deployment cost has limited adoption; a practical route to million-centroid indexes that still respects rare-token discriminativeness would lower that barrier. The explicit use of token-frequency information during clustering is a simple but under-explored idea that could generalize beyond the two evaluated collections.

major comments (2)

[§4.2] §4.2 (Scaling Experiments) and Table 4: the central claim that token-aware allocation “avoids under-representation of rare, discriminative tokens” at million-centroid scale rests on aggregate effectiveness numbers only. No per-token recall, cluster-purity, or rare-token coverage curves are shown as centroid count grows from 10k to 1M; without these, it is impossible to verify that the heuristic continues to protect the tokens that matter most for ranking.
[§4.3] §4.3 (Generalization): effectiveness is reported only on MS-MARCOv1 and LoTTE. No held-out query distribution, cross-domain, or OOD test set is evaluated. Because the token-aware heuristic explicitly depends on the training collection’s token statistics, the absence of any distribution-shift experiment leaves the robustness of the reported 9.8× speedup claim untested.

minor comments (3)

[§3.1] The notation for the token-frequency weighting term in the centroid-update rule (Eq. 3) is introduced without a clear definition of the smoothing constant; a one-sentence clarification would remove ambiguity.
[Figure 3] Figure 3 (clustering wall-clock time vs. number of centroids) uses a log-log scale but omits error bars or run-to-run variance; adding these would strengthen the 247× claim.
[§2] Related-work discussion of prior token-aware or frequency-balanced clustering methods (e.g., in NLP) is brief; two additional citations would better situate the novelty of the heuristic.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that will provide additional supporting analysis without altering the core claims or experimental setup of the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Scaling Experiments) and Table 4: the central claim that token-aware allocation “avoids under-representation of rare, discriminative tokens” at million-centroid scale rests on aggregate effectiveness numbers only. No per-token recall, cluster-purity, or rare-token coverage curves are shown as centroid count grows from 10k to 1M; without these, it is impossible to verify that the heuristic continues to protect the tokens that matter most for ranking.

Authors: We appreciate this observation. The primary evidence for the token-aware heuristic in §4.2 is the preservation (and in some cases improvement) of end-to-end ranking metrics such as MRR@10 and Recall@1000 at 1M centroids, which are known to be sensitive to the contribution of rare, discriminative tokens. Nevertheless, we agree that aggregate metrics alone leave room for more granular validation. In the revised manuscript we will add (i) rare-token coverage statistics (fraction of tokens appearing in fewer than 100 documents that receive at least one dedicated centroid) and (ii) cluster-purity curves as a function of centroid count for both TACHIOM and standard k-means. These plots will be derived from the same clustering runs already reported in Table 4 and will directly illustrate the heuristic’s effect on rare-token representation. revision: yes
Referee: [§4.3] §4.3 (Generalization): effectiveness is reported only on MS-MARCOv1 and LoTTE. No held-out query distribution, cross-domain, or OOD test set is evaluated. Because the token-aware heuristic explicitly depends on the training collection’s token statistics, the absence of any distribution-shift experiment leaves the robustness of the reported 9.8× speedup claim untested.

Authors: We acknowledge that the token-aware centroid allocation uses collection-specific token-frequency statistics and that stronger evidence of robustness under distribution shift would be beneficial. MS-MARCOv1 and LoTTE already represent distinct domains (web passages versus question-answering passages from multiple sources), and the reported speedups and effectiveness numbers remain consistent across both. To further address the concern, the revised manuscript will include an additional experiment that (a) derives token statistics from a 90 % subset of MS-MARCOv1 training data and (b) evaluates retrieval on the remaining 10 % held-out queries as well as on a small cross-domain sample from a third collection. This will provide a direct test of the 9.8× latency claim under modest distribution shift while keeping the experimental budget modest. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks without self-referential derivations or fitted predictions

full rationale

The paper introduces the TACHIOM system via algorithmic descriptions of token-aware centroid allocation, graph-based indexing, and product quantization, then validates performance through direct runtime and effectiveness measurements against external baselines (k-means, prior multivector systems) on MS-MARCOv1 and LoTTE. No equations, parameter fittings, or derivations appear in the provided text; claims of 247× clustering speedup and 9.8× retrieval speedup are presented as observed outcomes rather than constructed from the method's own inputs. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The derivation chain is therefore self-contained as an engineering contribution evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no explicit free parameters, axioms, or invented entities are described. The approach modifies standard k-means and product quantization (both from prior literature) with a token-distribution heuristic whose exact formulation is not given.

pith-pipeline@v0.9.0 · 5504 in / 1195 out tokens · 44295 ms · 2026-05-07T05:33:15.194244+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Zheng Bian, Man Lung Yiu, and Bo Tang. 2025. IGP: Efficient Multi-Vector Retrieval via Proximity Graph Index. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2524–2533. doi:10.1145/3726302.3730004

work page doi:10.1145/3726302.3730004 2025
[2]

Benjamin Clavié and Benjamin Warner. 2025. fastkmeans: Accelerated KMeans Clustering in PyTorch and Triton. https://github.com/AnswerDotAI/ fastkmeans/

2025
[3]

Leonardo Delfino, Domenico Erriquez, Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. 2025. kANNolo: Sweet and Smooth Approx- imate k-Nearest Neighbors Search. InProceedings of the 47th European Conference on Information Retrieval. 400–406

2025
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186. doi:10. 18653/v1/N19-1423

2019
[5]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The FAISS library.arXiv preprint arXiv:2401.08281(2024)

work page internal anchor Pith review arXiv 2024
[6]

Joshua Engels, Benjamin Coleman, Vihan Lakshman, and Anshumali Shrivastava
[7]

InProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS)

DESSERT: an efficient algorithm for vector set search with vector set queries. InProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). Article 2974, 21 pages. Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini
[8]

Yan Fang, Jingtao Zhan, Yiqun Liu, Jiaxin Mao, Min Zhang, and Shaoping Ma
[9]

InProceedings of the 11th CCF International Conference Natural Language Processing and Chinese Computing

Joint Optimization of Multi-vector Representation with Product Quantiza- tion. InProceedings of the 11th CCF International Conference Natural Language Processing and Chinese Computing. 631–642. doi:10.1007/978-3-031-17120-8_49

work page doi:10.1007/978-3-031-17120-8_49
[11]

arXiv:2109.10086 [cs] doi:10.48550/arXiv.2109.10086

SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)

work page arXiv 2021
[12]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
[13]

InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2353–2359. doi:10.1145/3477495.3531857

work page doi:10.1145/3477495.3531857
[14]

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization.IEEE Trans. Pattern Anal. Mach. Intell.36, 4 (2014), 744–755. doi:10. 1109/TPAMI.2013.240

2014
[15]

2024.Intel®oneAPI Math Kernel Library (oneMKL)

Intel Corporation. 2024.Intel®oneAPI Math Kernel Library (oneMKL). https: //software.intel.com/en-us/mkl

2024
[16]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search.IEEE transactions on pattern analysis and machine intelligence33 (2011), 117–28. doi:10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011
[17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781. doi:10.18653/v1/ 2020.emnlp-main.550

work page doi:10.18653/v1/ 2020
[18]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 39–48. doi:10.1145/3397271.3401075

work page doi:10.1145/3397271.3401075 2020
[19]

Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

work page arXiv 2024
[20]

Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming- Wei Chang, and Vincent Y. Zhao. 2023. Rethinking the role of token retrieval in multi-vector retrieval. InProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS). Article 677, 22 pages

2023
[21]

Sean MacAvaney and Nicola Tonellotto. 2024. A Reproducibility Study of PLAID. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1411–1419. doi:10.1145/3626772.3657856

work page doi:10.1145/3626772.3657856 2024
[22]

Malkov and Dmitry A

Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Trans. Pattern Anal. Mach. Intell.42, 4 (2020), 824–836. doi:10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2020
[23]

Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini
[24]

arXiv:2601.05200 [cs.IR] https://arxiv.org/abs/ 2601.05200

Multivector Reranking in the Era of Strong First-Stage Retrievers.arXiv preprint arXiv:2601.05200(2026). arXiv:2601.05200 [cs.IR] https://arxiv.org/abs/ 2601.05200

work page arXiv 2026
[25]

Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. 2024. Efficient Multi-vector Dense Retrieval with Bit Vectors. InProceedings of the 46th European Conference on Information Retrieval (ECIR). 3–17. doi:10.1007/978-3-031-56060- 6_1

work page doi:10.1007/978-3-031-56060- 2024
[26]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. InProceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches, Vol. 1773. https://ceur- ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf

2016
[27]

Robertson

Stephen E. Robertson. 2004. Understanding inverse document frequency: on theoretical arguments for IDF.J. Documentation60 (2004), 503–520. https: //api.semanticscholar.org/CorpusID:8864928

2004
[28]

Salton, A

G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing.Commun. ACM18, 11 (Nov. 1975), 613–620. doi:10.1145/361219.361220

work page doi:10.1145/361219.361220 1975
[29]

Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: An Efficient Engine for Late Interaction Retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM). 1747–1756. doi:10.1145/3511808.3557325

work page doi:10.1145/3511808.3557325 2022
[30]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3715–3734. doi:10.18653/v1/2022.naacl-main.272

work page doi:10.18653/v1/2022.naacl-main.272 2022
[31]

Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, and Omar Khattab. 2025. WARP: An Efficient Engine for Multi-Vector Retrieval. InProceed- ings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2504–2512. doi:10.1145/3726302.3729904

work page doi:10.1145/3726302.3729904 2025
[32]

1988.A statistical interpretation of term specificity and its application in retrieval

Karen Sparck Jones. 1988.A statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, GBR, 132–142

1988
[33]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Informa- tion Processing Systems (NeurIPS). 6000–6010

2017