The Role of Vocabularies in Learning Sparse Representations for Ranking
Pith reviewed 2026-05-18 15:51 UTC · model grok-4.3
The pith
Larger output vocabularies in SPLADE models, after logit pruning, deliver effectiveness comparable to standard models at BM25-level computational costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The size and pretrained weight of output vocabularies configure the representational specification for queries, documents, and their interactions in the retrieval engine, allowing 100K models with pruning to match or exceed the effectiveness of 32K SPLADE models within BM25 computational budgets, and ESPLADE initialization to outperform random vocab initialization at comparable retrieval cost.
What carries the argument
Output vocabulary in the BERT-based SPLADE model, expanded to 100K terms with either ESPLADE pretraining or random initialization, followed by logit-score pruning to a fixed maximum size for balancing efficiency.
If this is right
- The pruned 100K models achieve effectiveness comparable to or better than the 32K SPLADE under the same computational budget as BM25.
- ESPLADE-pretrained models outperform randomly initialized vocab models while maintaining similar retrieval costs.
- Vocabulary size and pretraining provide a way to configure representational specifications for more efficient and effective LSR.
- These configurations extend beyond standard NLP meanings to impact query-document matching in retrieval engines.
Where Pith is reading between the lines
- If larger vocabularies work better after pruning, future work could test even bigger sizes or alternative pruning methods to further optimize the trade-off.
- The role of vocabulary might connect to how semantic granularity affects sparse matching across different retrieval datasets.
- Practitioners could experiment with custom vocabulary initializations tailored to specific search domains for improved performance.
Load-bearing premise
The assumption that pruning model outputs to a fixed maximum size based on scores yields fair efficiency-effectiveness comparisons across different vocabulary sizes and starting points, without biases tied to the specific data or cut-off choices.
What would settle it
Running the same experiments on a different search dataset or with a varied pruning threshold and finding that the 100K models no longer match the 32K SPLADE effectiveness within the BM25 cost budget would falsify the claim.
Figures
read the original abstract
Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After finetune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-sized normal SPLADE model in the computational budget under the BM25. And the ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost. The result indicates that the size and pretrained weight of output vocabularies play the role of configuring the representational specification for queries, documents, and their interactions in the retrieval engine, beyond their original meaning and purposes in NLP. These findings can provide a new room for improvement for LSR by identifying the importance of representational specification from vocabulary configuration for efficient and effective retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the role of output vocabulary size and initialization in learned sparse retrieval (LSR) models such as SPLADE. The authors construct 100K-vocabulary BERT models—one initialized via ESPLADE pretraining and one randomly—fine-tune both on search click logs, and apply logit-score-based pruning of queries and documents to a maximum size. They claim that the resulting pruned 100K models outperform a standard 32K SPLADE baseline in effectiveness under a BM25 computational budget, with the pretrained ESPLADE variant further outperforming the random-vocabulary model at comparable retrieval cost. The authors conclude that vocabulary size and pretraining configure representational specifications for queries, documents, and their interactions beyond conventional NLP usage.
Significance. If the empirical comparisons hold after addressing the noted gaps, the work would demonstrate that vocabulary configuration can be used as a controllable lever to improve the efficiency-effectiveness trade-off in LSR without increasing retrieval cost, thereby identifying a new design axis for sparse first-stage rankers that complements existing pruning and regularization techniques.
major comments (2)
- [Experimental setup and results (pruning description)] The central claim (abstract and experimental results) that pruned 100K models outperform the 32K SPLADE baseline under a BM25 budget rests on the assumption that logit-score pruning to a fixed maximum size is neutral with respect to vocabulary size. However, the larger candidate pool in the 100K models could systematically select higher-quality or more diverse terms at the same numeric max-size threshold, altering effective sparsity and posting-list costs; no ablation across pruning thresholds, no reporting of post-pruning average non-zeros per query/document, and no comparison of term-quality distributions are provided to rule out this interaction.
- [Abstract and §4 (results)] The abstract and results section state that the 100K models are 'effective compared to the 32K-sized normal SPLADE model' and that 'ESPLADE models are more effective than the random vocab model,' yet supply no concrete effectiveness metrics (e.g., nDCG@10, Recall@1000), no confidence intervals, and no table of raw scores or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported gains.
minor comments (2)
- [Method] The notation for the pruning procedure ('logit score-based queries and documents pruning to max size') is introduced without a formal definition or pseudocode; adding an equation or algorithm box would clarify how the maximum size is enforced separately for queries versus documents.
- [Experimental setup] The manuscript refers to 'the evaluation set' without specifying the dataset name, split, or query count; explicit citation of the test collection (e.g., MS MARCO or a proprietary log) is needed for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us identify areas for improvement in our manuscript on the role of vocabularies in SPLADE models. We address each major comment in detail below, indicating the revisions we plan to make.
read point-by-point responses
-
Referee: [Experimental setup and results (pruning description)] The central claim (abstract and experimental results) that pruned 100K models outperform the 32K SPLADE baseline under a BM25 budget rests on the assumption that logit-score pruning to a fixed maximum size is neutral with respect to vocabulary size. However, the larger candidate pool in the 100K models could systematically select higher-quality or more diverse terms at the same numeric max-size threshold, altering effective sparsity and posting-list costs; no ablation across pruning thresholds, no reporting of post-pruning average non-zeros per query/document, and no comparison of term-quality distributions are provided to rule out this interaction.
Authors: We appreciate this insightful observation regarding the potential non-neutrality of the pruning process across different vocabulary sizes. To strengthen the manuscript, we will revise Section 4 to include the average number of non-zero terms for queries and documents after pruning for all compared models. We will also add an ablation study examining performance across a range of pruning thresholds (maximum sizes) to demonstrate that the effectiveness gains of the 100K models hold consistently. While a detailed comparison of term-quality distributions would require additional analysis beyond the current scope, the controlled computational budget under BM25 and the superior effectiveness metrics support that the larger vocabulary enables better representational choices. We believe these additions will adequately address the concern. revision: partial
-
Referee: [Abstract and §4 (results)] The abstract and results section state that the 100K models are 'effective compared to the 32K-sized normal SPLADE model' and that 'ESPLADE models are more effective than the random vocab model,' yet supply no concrete effectiveness metrics (e.g., nDCG@10, Recall@1000), no confidence intervals, and no table of raw scores or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported gains.
Authors: We acknowledge the need for more quantitative details in the abstract and results presentation. In the revised manuscript, we will expand the abstract to reference specific improvements and include a comprehensive table in Section 4 with raw effectiveness scores (nDCG@10, Recall@1000, etc.) for the baseline and proposed models, both pre- and post-pruning. Confidence intervals will be reported, and we will conduct and report statistical significance tests to validate the differences. This will allow readers to better evaluate the magnitude and reliability of the gains. revision: yes
Circularity Check
No significant circularity; results are empirical and self-contained
full rationale
The paper reports an empirical study: BERT models are built with 100K output vocabularies (one ESPLADE-pretrained, one random), fine-tuned on click logs, then pruned by logit scores to a fixed maximum size before comparing effectiveness and retrieval cost against a 32K SPLADE baseline under BM25 budget. All claims rest on these direct experimental measurements of effectiveness and efficiency; the provided text contains no equations, derivations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled through prior work. The derivation chain is therefore absent, and the results do not reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Logit-score pruning to a fixed maximum size yields comparable retrieval cost across different vocabulary sizes and initializations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we applied logit score-based queries and documents pruning to max size... ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vocab size... plays the role of configuring the representational specification
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
From Tokens to Concepts: Leveraging SAE for SPLADE
SAE-SPLADE substitutes SPLADE's backbone vocabulary with SAE-derived semantic concepts and matches standard SPLADE performance with better efficiency on in- and out-of-domain tasks.
Reference graph
Works this paper leans on
-
[1]
Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. Sparterm: Learning term-based sparse represen- tation for fast text retrieval.arXiv preprint arXiv:2010.00768,
-
[2]
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´ an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116,
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[3]
Zhuyun Dai and Jamie Callan. Context-aware sentence/passage term importance estimation for first stage retrieval.arXiv preprint arXiv:1910.10687,
-
[4]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,
work page 2019
-
[5]
Language-agnostic BERT sentence embedding, 2022
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding.arXiv preprint arXiv:2007.01852,
-
[6]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and St´ ephane Clinchant. Splade v2: Sparse lexical and expansion model for information retrieval.arXiv preprint arXiv:2109.10086,
-
[7]
Splate: Sparse late interaction retrieval
16 The Role of Vocabularies in Learning Sparse Representations for Ranking Thibault Formal, St´ ephane Clinchant, Herv´ e D´ ejean, and Carlos Lassance. Splate: Sparse late interaction retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2635–2640, 2024a. Thibault Formal, Carlo...
-
[8]
Sebastian Hofst¨ atter, Sophia Althammer, Michael Schr¨ oder, Mete Sertkan, and Allan Han- bury. Improving efficient neural ranking models with cross-architecture knowledge distil- lation.arXiv preprint arXiv:2010.02666,
-
[9]
In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. InProceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 163–173,
work page 2021
-
[10]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[11]
Joel Mackenzie, Andrew Trotman, and Jimmy Lin. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation.arXiv preprint arXiv:2110.11540,
-
[12]
Exploring the representation power of splade models
Joel Mackenzie, Shengyao Zhuang, and Guido Zuccon. Exploring the representation power of splade models. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, pages 143–147,
work page 2023
-
[13]
Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert.arXiv preprint arXiv:1901.04085,
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[14]
Document expansion by query prediction.arXiv preprint arXiv:1904.08375,
Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction.arXiv preprint arXiv:1904.08375,
-
[15]
Minimizing flops to learn efficient sparse representations.arXiv preprint arXiv:2004.05665,
Biswajit Paria, Chih-Kuan Yeh, Ian EH Yen, Ning Xu, Pradeep Ravikumar, and Barnab´ as P´ oczos. Minimizing flops to learn efficient sparse representations.arXiv preprint arXiv:2004.05665,
-
[16]
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2010.08191,
-
[17]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084,
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[18]
Within-document term-based index pruning with statistical hypothesis testing
Sree Lekha Thota and Ben Carterette. Within-document term-based index pruning with statistical hypothesis testing. InAdvances in Information Retrieval: 33rd European Con- ference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21,
work page 2011
-
[19]
Approximate nearest neighbor negative contrastive learning for dense text retrieval,
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learn- ing for dense text retrieval.arXiv preprint arXiv:2007.00808,
-
[20]
Jheng-Hong Yang, Xueguang Ma, and Jimmy Lin. Sparsifying sparse representations for passage retrieval by top-kmasking.arXiv preprint arXiv:2112.09628,
-
[21]
Transfer Learning for Low-Resource Neural Machine Translation
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low- resource neural machine translation.arXiv preprint arXiv:1604.02201,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
0.9427 0.7071 0.8112 rand-100k-ts *1 *2 1000 2000
19 Kim, Lee and Won model topk topk validset q K d K L q L d L j dual loss mask KL L0 q L0 d MRR@10 R@10 R@100 bm25 . . . . . . . . . 0.9427 0.7071 0.8112 rand-100k-ts *1 *2 1000 2000 . . 5 X X 10.81 26.92 0.9285 0.7182 0.8663 splade-32K-ts-0.1m-e1 . . 5 0.2 . X X 9.67 31.03 0.947 0.7298 0.8648 splade-32K-ts-0.1m-e2 . . . . 5 X X 8.05 20.31 0.9418 0.7262 ...
work page 2000
-
[23]
The q K and d K are top-k masking from Yang et al
All models, except otherwise noted, follow these training configurations: The batch size of 7168, the learning rate of 1e-4, 16-bit (mixed) precision training, and the training step of 100K steps. The q K and d K are top-k masking from Yang et al. (2021). The L q, L d is the FLOPS regularizer weight. The L j is the joint FLOPS regularizer weight. The dual...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.