A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Benjamin Piwowarski; Joseph Le Roux; Nadi Tomeh; Yash Kankanampati; Yuxuan Zong

arxiv: 2603.09933 · v3 · submitted 2026-03-10 · 💻 cs.IR

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Yash Kankanampati , Yuxuan Zong , Nadi Tomeh , Benjamin Piwowarski , Joseph Le Roux This is my paper

Pith reviewed 2026-05-15 13:05 UTC · model grok-4.3

classification 💻 cs.IR

keywords token pruninglate-interaction retrievalVoronoi cellsdense embeddingsindex compressionembedding geometryColBERT

0 comments

The pith

Token pruning in late-interaction models can be framed as estimating Voronoi cell sizes in embedding space to cut index size while keeping retrieval quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Late-interaction retrievers store an embedding for every document token, which inflates storage. The paper replaces ad-hoc importance scores with a geometric rule: each token's pruning priority equals the size of the Voronoi cell it owns in the learned embedding space. Larger cells indicate greater influence on similarity computations, so they are retained; smaller cells are dropped. Experiments show the resulting indexes are smaller yet match the effectiveness of the full model and outperform prior statistical pruning baselines. The same cell sizes also expose regularities in how individual tokens contribute to ranking decisions.

Core claim

We cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token's influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size.

What carries the argument

Voronoi cell size in the embedding space, used as a direct geometric proxy for each token's contribution to late-interaction similarity scores.

If this is right

Index storage can be reduced by discarding embeddings whose Voronoi regions are small, without separate importance classifiers.
The same cell-size scores give an interpretable ranking of which tokens drive retrieval decisions inside any late-interaction model.
Pruning decisions become deterministic once the embedding space is fixed, removing the need for task-specific tuning of pruning thresholds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric framing could be applied to prune other per-token representations, such as those in late-interaction question-answering or dense passage rerankers.
If Voronoi volumes correlate with token frequency or semantic specificity, the method might also inform vocabulary construction or embedding regularization during training.
Query-dependent pruning extensions could recompute only the cells relevant to a given query embedding, further shrinking runtime memory.

Load-bearing premise

The volume of a token's Voronoi cell in the trained embedding space accurately tracks how much that token affects final retrieval rankings.

What would settle it

A controlled test in which documents are pruned by Voronoi cell size yet retrieval metrics drop below those achieved by the best statistical pruning baseline on the same collection and queries.

Figures

Figures reproduced from arXiv: 2603.09933 by Benjamin Piwowarski, Joseph Le Roux, Nadi Tomeh, Yash Kankanampati, Yuxuan Zong.

**Figure 2.** Figure 2: An example illustrating the difference in iterative and non-iterative Voronoi pruning of 2D document vectors. Each [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of LP Pruning and Voronoi Pruning [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution showing when a token at a particular [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Relationship between Mean Error and retrieval per [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Late-interaction models such as ColBERT offer competitive performance across various retrieval tasks but require storing a dense embedding for each document token, leading to a substantial index storage overhead. Past works address this by attempting to prune low-importance token embeddings based on statistical and empirical measures, but they often either lack formal grounding or are ineffective. To address these shortcomings, we introduce a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token's influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size. Through our experiments, we demonstrate that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Voronoi-cell framing for token pruning is a clean geometric shift from statistical methods, but the assumption that cell volume tracks actual max-similarity utility in late-interaction scoring looks like the load-bearing weak point.

read the letter

The paper's core move is to recast token pruning in ColBERT-style late-interaction models as a Voronoi cell estimation task in embedding space. Each token's importance is tied to the size of its region rather than frequency counts or learned weights. This is the main new element: a direct geometric construction instead of the empirical or statistical rules in earlier pruning work. If the mapping holds, it gives a more interpretable way to shrink indexes while keeping retrieval quality, and the abstract indicates the experiments position it as competitive on that front. The side benefit they flag—using the same framing to inspect token-level behavior—is a reasonable extra payoff for people who already work with these models. The soft spot is exactly the one the stress-test flags. Late-interaction scoring takes the max similarity per query token, so a token's real utility depends on whether queries actually land in its region as the winner. A small Voronoi cell could still be the unique high-value match for important queries, while large cells might sit in low-impact areas. Pruning by cell volume therefore risks cutting the wrong tokens unless query embeddings closely follow the same measure as the document embedding space. The paper would need to show that this alignment holds in practice, perhaps through ablation on query distributions or direct correlation between cell size and retrieval contribution. Without that link demonstrated, the geometric claim stays more aspirational than proven. This is for groups already running late-interaction systems and looking for index compression options. A reader focused on geometric or formal methods in IR would get the most from the framing. It is worth sending to peer review because the problem is real, the angle is distinct, and the experiments are claimed to exist; a referee can check whether the results actually close the gap between cell volume and max-similarity impact.

Referee Report

2 major / 2 minor

Summary. The paper proposes casting token pruning in late-interaction models (e.g., ColBERT) as a Voronoi cell estimation problem in the learned embedding space. Token influence is defined as the volume of each token's Voronoi region, providing a geometric basis for pruning low-influence embeddings to reduce index size while preserving retrieval quality. Experiments position the method as competitive with prior statistical pruning techniques and as an interpretive tool for token-level behavior.

Significance. If the geometric measure aligns with actual retrieval contributions, the approach supplies a parameter-free, formally grounded alternative to heuristic pruning, with potential benefits for index compression and model interpretability in dense retrieval. The absence of free parameters and the direct construction from embedding geometry are notable strengths.

major comments (2)

[§3] §3 (Voronoi formulation): the central claim that Voronoi cell volume equals token influence for late-interaction scoring is not justified. Late-interaction scores use max_{d_token} sim(q, d_token) per query token; cell volume is a measure under the uniform embedding measure, but query embeddings may concentrate in small cells that are nevertheless the unique maximizer for important queries. This mismatch is load-bearing for the pruning guarantee.
[§4.3] §4.3 (experimental validation): the reported nDCG@10 and recall curves at varying pruning ratios do not include controls that vary query embedding distribution independently of the document embedding measure. Without such controls, it is unclear whether observed retention of quality stems from the Voronoi sizes or from incidental correlation with frequency-based importance.

minor comments (2)

[§3.1] Notation for the embedding space metric and the precise definition of cell volume (e.g., whether Lebesgue measure or a learned density) is introduced without an explicit equation reference in the main text; add a numbered equation.
[Figure 2] Figure 2 (Voronoi diagram example) lacks axis labels and scale; the visual does not indicate whether the plotted space is the full embedding dimension or a PCA projection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Voronoi formulation): the central claim that Voronoi cell volume equals token influence for late-interaction scoring is not justified. Late-interaction scores use max_{d_token} sim(q, d_token) per query token; cell volume is a measure under the uniform embedding measure, but query embeddings may concentrate in small cells that are nevertheless the unique maximizer for important queries. This mismatch is load-bearing for the pruning guarantee.

Authors: We agree that the Voronoi cell volume is computed under the uniform measure and does not directly equate to the probability of being the maximizer under arbitrary query distributions. The manuscript presents the Voronoi volume as a geometric interpretation of token influence rather than a strict equality. To address this concern, we will revise Section 3 to explicitly discuss the relationship between cell volume and the max-similarity scoring, including the assumptions under which the volume serves as a proxy for influence. This will include noting potential limitations when query embeddings are highly concentrated. revision: yes
Referee: [§4.3] §4.3 (experimental validation): the reported nDCG@10 and recall curves at varying pruning ratios do not include controls that vary query embedding distribution independently of the document embedding measure. Without such controls, it is unclear whether observed retention of quality stems from the Voronoi sizes or from incidental correlation with frequency-based importance.

Authors: The experiments in §4.3 compare our method against frequency-based pruning baselines on standard retrieval benchmarks with real query sets. The competitive performance relative to frequency-based methods indicates that the Voronoi measure captures geometric properties beyond mere token frequency in the document collection. However, we acknowledge the value of additional controls. We will add a discussion in the revised manuscript explaining why the current experimental setup provides evidence against pure incidental correlation, and if space permits, include a small-scale synthetic experiment varying query distributions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; geometric formulation is a modeling choice, not a reduction to inputs

full rationale

The paper proposes casting token pruning as Voronoi cell estimation in embedding space and defines each token's influence as the measure of its Voronoi region. This is presented as a direct geometric construction rather than a derivation from fitted parameters, self-citations, or prior results by the same authors. No equations, self-citations, or 'predictions' that reduce by construction to the inputs appear in the abstract or described framework. The approach is self-contained as a new principled method whose effectiveness is evaluated empirically, with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Voronoi regions in embedding space correspond to token influence for retrieval; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Voronoi cells in the embedding space measure each token's influence for retrieval.
Invoked when casting token pruning as Voronoi cell estimation.

pith-pipeline@v0.9.0 · 5449 in / 1019 out tokens · 36874 ms · 2026-05-15T13:05:04.014111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Antonio Acquavia, Craig Macdonald, and Nicola Tonellotto. 2023. Static Pruning for Multi-Representation Dense Retrieval. InProceedings of the ACM Sympo- sium on Document Engineering 2023 (DocEng ’23). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3573128.3604896

work page doi:10.1145/3573128.3604896 2023
[2]

Franz Aurenhammer. 1991. Voronoi diagrams—a survey of a fundamental geometric data structure.ACM Comput. Surv.23 (1991), 345–405. https: //api.semanticscholar.org/CorpusID:4613674

work page 1991
[3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al

work page
[4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Bytez.com, Rajesh Jayaram, Laxman Dhulipala, Majid Hadian, Jason Lee, and Vahab Mirrokni. 2024. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Enc... https://bytez.com/docs/neurips/94793/paper. Kankanampati et al

work page 2024
[6]

Bytez.com, Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Y. Zhao. 2023. Rethinking the Role of Token Retrieval in Multi-Vector R... https://bytez.com/docs/neurips/71237/paper

work page 2023
[7]

Benjamin Clavié, Antoine Chaffin, and Griffin Adams. 2024. Reducing the Foot- print of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling. arXiv:2409.14683 [cs] doi:10.48550/arXiv.2409.14683

work page doi:10.48550/arxiv.2409.14683 2024
[8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. InText REtrieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/publication/overview-of- the-trec-2020-deep-learning-track/

work page 2021
[9]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. InText RE- trieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/ publication/overview-of-the-trec-2019-deep-learning-track/

work page 2020
[10]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

work page
[11]

doi:10.48550/ARXIV.2109.10086

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 [cs] doi:10.48550/arXiv.2109.10086

work page doi:10.48550/arxiv.2109.10086
[12]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-T...

work page doi:10.18653/v1/2021.naacl- 2021
[13]

Nathan Godey, Éric Clergerie, and Benoît Sagot. 2024. Anisotropy Is Inherent to Self-Attention in Transformers. InProceedings of the 18th Conference of the Euro- pean Chapter of the Association for Computational Linguistics (Volume 1: Long Pa- pers), Yvette Graham and Matthew Purver (Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 3...

work page 2024
[14]

Shanxiu He, Mutasem Al-Darabsah, Suraj Nair, Jonathan May, Tarun Agarwal, Tao Yang, and Choon Hui Teo. 2025. Token Pruning Optimization for Effi- cient Multi-vector Dense Retrieval. InAdvances in Information Retrieval, Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicola Tone...

work page doi:10.1007/978-3-031-88708-6_7 2025
[15]

Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Al- lan Hanbury. 2022. Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions Using Enhanced Reduction. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM ’22). Association for Computing Machinery, New Yor...

work page doi:10.1145/3511808.3557367 2022
[16]

Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022. Introducing Neural Bag of Whole-Words with ColBERTer: Con- textualized Late Interactions Using Enhanced Reduction. arXiv:2203.13088 [cs] doi:10.48550/arXiv.2203.13088

work page doi:10.48550/arxiv.2203.13088 2022
[17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Com...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[18]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3397271.3401075

work page doi:10.1145/3397271.3401075 2020
[19]

Carlos Lassance, Maroua Maachou, Joohee Park, and Stéphane Clinchant. 2022. Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT). InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2232–2236. doi:10.11...

work page doi:10.1145/3477495.3531835 2022
[20]

Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. 2023. Rethinking the role of token retrieval in multi-vector retrieval.Advances in Neural Information Processing Systems36 (2023), 15384–15405

work page 2023
[21]

Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Ann...

work page doi:10.18653/v1/2023.acl-long.663 2023
[22]

Qi Liu, Gang Guo, Jiaxin Mao, Zhicheng Dou, Ji-Rong Wen, Hao Jiang, Xinyu Zhang, and Zhao Cao. 2024. An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models.ACM Trans. Inf. Syst.42, 5 (April 2024), 118:1–118:28. doi:10.1145/3639818

work page doi:10.1145/3639818 2024
[23]

Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III. Springer-Verlag, Berlin, Heidelberg, 237–245. doi:10.1007/ 978-3-031-88714-7_22

work page 2025
[24]

Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, and Vincent Y. Zhao. 2022. Multi-Vector Retrieval as Sparse Alignment. arXiv:2211.01267 [cs] doi:10.48550/arXiv.2211.01267

work page doi:10.48550/arxiv.2211.01267 2022
[25]

Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: An Efficient Engine for Late Interaction Retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM ’22). Association for Computing Machinery, New York, NY, USA, 1747–

work page 2022
[26]

doi:10.1145/3511808.3557325

work page doi:10.1145/3511808.3557325
[27]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpuat, Marie-Catherine de Marn...

work page 2022
[28]

doi:10.18653/v1/2022.naacl-main.272

work page doi:10.18653/v1/2022.naacl-main.272 2022
[29]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview. net/forum?id=wCu6T5xFjeJ

work page 2021
[30]

João Veneroso, Rajesh Jayaram, Jinmeng Rao, Gustavo Hernández Ábrego, Majid Hadian, and Daniel Cer. 2025. CRISP: Clustering Multi-Vector Representations for Denoising and Pruning. arXiv:2505.11471 [cs] doi:10.48550/arXiv.2505.11471

work page doi:10.48550/arxiv.2505.11471 2025
[31]

Yuxuan Zong and Benjamin Piwowarski. 2025. Towards Lossless Token Pruning in Late-Interaction Retrieval Models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2407–2417. doi:10.1145/3726302.3730100

work page doi:10.1145/3726302.3730100 2025

[1] [1]

Antonio Acquavia, Craig Macdonald, and Nicola Tonellotto. 2023. Static Pruning for Multi-Representation Dense Retrieval. InProceedings of the ACM Sympo- sium on Document Engineering 2023 (DocEng ’23). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3573128.3604896

work page doi:10.1145/3573128.3604896 2023

[2] [2]

Franz Aurenhammer. 1991. Voronoi diagrams—a survey of a fundamental geometric data structure.ACM Comput. Surv.23 (1991), 345–405. https: //api.semanticscholar.org/CorpusID:4613674

work page 1991

[3] [3]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al

work page

[4] [4]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Bytez.com, Rajesh Jayaram, Laxman Dhulipala, Majid Hadian, Jason Lee, and Vahab Mirrokni. 2024. MUVERA: Multi-Vector Retrieval via Fixed Dimensional Enc... https://bytez.com/docs/neurips/94793/paper. Kankanampati et al

work page 2024

[6] [6]

Bytez.com, Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Y. Zhao. 2023. Rethinking the Role of Token Retrieval in Multi-Vector R... https://bytez.com/docs/neurips/71237/paper

work page 2023

[7] [7]

Benjamin Clavié, Antoine Chaffin, and Griffin Adams. 2024. Reducing the Foot- print of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling. arXiv:2409.14683 [cs] doi:10.48550/arXiv.2409.14683

work page doi:10.48550/arxiv.2409.14683 2024

[8] [8]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. InText REtrieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/publication/overview-of- the-trec-2020-deep-learning-track/

work page 2021

[9] [9]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. InText RE- trieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/ publication/overview-of-the-trec-2019-deep-learning-track/

work page 2020

[10] [10]

Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

work page

[11] [11]

doi:10.48550/ARXIV.2109.10086

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv:2109.10086 [cs] doi:10.48550/arXiv.2109.10086

work page doi:10.48550/arxiv.2109.10086

[12] [12]

Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. InProceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-T...

work page doi:10.18653/v1/2021.naacl- 2021

[13] [13]

Nathan Godey, Éric Clergerie, and Benoît Sagot. 2024. Anisotropy Is Inherent to Self-Attention in Transformers. InProceedings of the 18th Conference of the Euro- pean Chapter of the Association for Computational Linguistics (Volume 1: Long Pa- pers), Yvette Graham and Matthew Purver (Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 3...

work page 2024

[14] [14]

Shanxiu He, Mutasem Al-Darabsah, Suraj Nair, Jonathan May, Tarun Agarwal, Tao Yang, and Choon Hui Teo. 2025. Token Pruning Optimization for Effi- cient Multi-vector Dense Retrieval. InAdvances in Information Retrieval, Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Silvestri, and Nicola Tone...

work page doi:10.1007/978-3-031-88708-6_7 2025

[15] [15]

Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Al- lan Hanbury. 2022. Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions Using Enhanced Reduction. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM ’22). Association for Computing Machinery, New Yor...

work page doi:10.1145/3511808.3557367 2022

[16] [16]

Sebastian Hofstätter, Omar Khattab, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022. Introducing Neural Bag of Whole-Words with ColBERTer: Con- textualized Late Interactions Using Enhanced Reduction. arXiv:2203.13088 [cs] doi:10.48550/arXiv.2203.13088

work page doi:10.48550/arxiv.2203.13088 2022

[17] [17]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Com...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[18] [18]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 39–48. doi:10.1145/3397271.3401075

work page doi:10.1145/3397271.3401075 2020

[19] [19]

Carlos Lassance, Maroua Maachou, Joohee Park, and Stéphane Clinchant. 2022. Learned Token Pruning in Contextualized Late Interaction over BERT (ColBERT). InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2232–2236. doi:10.11...

work page doi:10.1145/3477495.3531835 2022

[20] [20]

Jinhyuk Lee, Zhuyun Dai, Sai Meher Karthik Duddu, Tao Lei, Iftekhar Naim, Ming-Wei Chang, and Vincent Zhao. 2023. Rethinking the role of token retrieval in multi-vector retrieval.Advances in Neural Information Processing Systems36 (2023), 15384–15405

work page 2023

[21] [21]

Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Ann...

work page doi:10.18653/v1/2023.acl-long.663 2023

[22] [22]

Qi Liu, Gang Guo, Jiaxin Mao, Zhicheng Dou, Ji-Rong Wen, Hao Jiang, Xinyu Zhang, and Zhao Cao. 2024. An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models.ACM Trans. Inf. Syst.42, 5 (April 2024), 118:1–118:28. doi:10.1145/3639818

work page doi:10.1145/3639818 2024

[23] [23]

Sean MacAvaney, Antonio Mallia, and Nicola Tonellotto. 2025. Efficient Constant- Space Multi-vector Retrieval. InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part III. Springer-Verlag, Berlin, Heidelberg, 237–245. doi:10.1007/ 978-3-031-88714-7_22

work page 2025

[24] [24]

Yujie Qian, Jinhyuk Lee, Sai Meher Karthik Duddu, Zhuyun Dai, Siddhartha Brahma, Iftekhar Naim, Tao Lei, and Vincent Y. Zhao. 2022. Multi-Vector Retrieval as Sparse Alignment. arXiv:2211.01267 [cs] doi:10.48550/arXiv.2211.01267

work page doi:10.48550/arxiv.2211.01267 2022

[25] [25]

Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: An Efficient Engine for Late Interaction Retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM ’22). Association for Computing Machinery, New York, NY, USA, 1747–

work page 2022

[26] [26]

doi:10.1145/3511808.3557325

work page doi:10.1145/3511808.3557325

[27] [27]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpuat, Marie-Catherine de Marn...

work page 2022

[28] [28]

doi:10.18653/v1/2022.naacl-main.272

work page doi:10.18653/v1/2022.naacl-main.272 2022

[29] [29]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview. net/forum?id=wCu6T5xFjeJ

work page 2021

[30] [30]

João Veneroso, Rajesh Jayaram, Jinmeng Rao, Gustavo Hernández Ábrego, Majid Hadian, and Daniel Cer. 2025. CRISP: Clustering Multi-Vector Representations for Denoising and Pruning. arXiv:2505.11471 [cs] doi:10.48550/arXiv.2505.11471

work page doi:10.48550/arxiv.2505.11471 2025

[31] [31]

Yuxuan Zong and Benjamin Piwowarski. 2025. Towards Lossless Token Pruning in Late-Interaction Retrieval Models. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2407–2417. doi:10.1145/3726302.3730100

work page doi:10.1145/3726302.3730100 2025