pith. sign in

arxiv: 2509.16621 · v2 · submitted 2025-09-20 · 💻 cs.IR · cs.CL

The Role of Vocabularies in Learning Sparse Representations for Ranking

Pith reviewed 2026-05-18 15:51 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords learned sparse retrievalSPLADEvocabulary sizepruningESPLADEranking efficiencysparse representations
0
0 comments X

The pith

Larger output vocabularies in SPLADE models, after logit pruning, deliver effectiveness comparable to standard models at BM25-level computational costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how the size and initialization of output vocabularies affect sparse representations in learned sparse retrieval models like SPLADE. They train BERT-based models with 100,000-word vocabularies, one using ESPLADE pretraining and one random, then fine-tune on search logs and prune outputs by logit scores to a max size. Results show these pruned models perform at least as well as the usual 32,000-word SPLADE under the same efficiency budget as BM25, with the pretrained version edging out the random one at similar cost. This suggests vocabulary choices shape how queries and documents interact in the retrieval system beyond their usual NLP roles. The findings point to vocabulary configuration as a lever for better efficiency and effectiveness in sparse ranking.

Core claim

The size and pretrained weight of output vocabularies configure the representational specification for queries, documents, and their interactions in the retrieval engine, allowing 100K models with pruning to match or exceed the effectiveness of 32K SPLADE models within BM25 computational budgets, and ESPLADE initialization to outperform random vocab initialization at comparable retrieval cost.

What carries the argument

Output vocabulary in the BERT-based SPLADE model, expanded to 100K terms with either ESPLADE pretraining or random initialization, followed by logit-score pruning to a fixed maximum size for balancing efficiency.

If this is right

  • The pruned 100K models achieve effectiveness comparable to or better than the 32K SPLADE under the same computational budget as BM25.
  • ESPLADE-pretrained models outperform randomly initialized vocab models while maintaining similar retrieval costs.
  • Vocabulary size and pretraining provide a way to configure representational specifications for more efficient and effective LSR.
  • These configurations extend beyond standard NLP meanings to impact query-document matching in retrieval engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If larger vocabularies work better after pruning, future work could test even bigger sizes or alternative pruning methods to further optimize the trade-off.
  • The role of vocabulary might connect to how semantic granularity affects sparse matching across different retrieval datasets.
  • Practitioners could experiment with custom vocabulary initializations tailored to specific search domains for improved performance.

Load-bearing premise

The assumption that pruning model outputs to a fixed maximum size based on scores yields fair efficiency-effectiveness comparisons across different vocabulary sizes and starting points, without biases tied to the specific data or cut-off choices.

What would settle it

Running the same experiments on a different search dataset or with a varied pruning threshold and finding that the 100K models no longer match the 32K SPLADE effectiveness within the BM25 cost budget would falsify the claim.

Figures

Figures reproduced from arXiv: 2509.16621 by Hiun Kim, Tae Kwan Lee, Taeryun Won.

Figure 1
Figure 1. Figure 1: Result on Evaluation Set. Models are trained on trainset. For the splade-32K model (32K output vocab), we trained with top-k masking for q K=500, d K=1000. For other models (100K output vocab), we trained with top-k masking for q K=1000, d K=2000. The qk and dk are the max size of Q and D terms to represent each Q and D, where the terms with top-k highest MLM logit score are included (value of 0 means unpr… view at source ↗
read the original abstract

Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After finetune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-sized normal SPLADE model in the computational budget under the BM25. And the ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost. The result indicates that the size and pretrained weight of output vocabularies play the role of configuring the representational specification for queries, documents, and their interactions in the retrieval engine, beyond their original meaning and purposes in NLP. These findings can provide a new room for improvement for LSR by identifying the importance of representational specification from vocabulary configuration for efficient and effective retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies the role of output vocabulary size and initialization in learned sparse retrieval (LSR) models such as SPLADE. The authors construct 100K-vocabulary BERT models—one initialized via ESPLADE pretraining and one randomly—fine-tune both on search click logs, and apply logit-score-based pruning of queries and documents to a maximum size. They claim that the resulting pruned 100K models outperform a standard 32K SPLADE baseline in effectiveness under a BM25 computational budget, with the pretrained ESPLADE variant further outperforming the random-vocabulary model at comparable retrieval cost. The authors conclude that vocabulary size and pretraining configure representational specifications for queries, documents, and their interactions beyond conventional NLP usage.

Significance. If the empirical comparisons hold after addressing the noted gaps, the work would demonstrate that vocabulary configuration can be used as a controllable lever to improve the efficiency-effectiveness trade-off in LSR without increasing retrieval cost, thereby identifying a new design axis for sparse first-stage rankers that complements existing pruning and regularization techniques.

major comments (2)
  1. [Experimental setup and results (pruning description)] The central claim (abstract and experimental results) that pruned 100K models outperform the 32K SPLADE baseline under a BM25 budget rests on the assumption that logit-score pruning to a fixed maximum size is neutral with respect to vocabulary size. However, the larger candidate pool in the 100K models could systematically select higher-quality or more diverse terms at the same numeric max-size threshold, altering effective sparsity and posting-list costs; no ablation across pruning thresholds, no reporting of post-pruning average non-zeros per query/document, and no comparison of term-quality distributions are provided to rule out this interaction.
  2. [Abstract and §4 (results)] The abstract and results section state that the 100K models are 'effective compared to the 32K-sized normal SPLADE model' and that 'ESPLADE models are more effective than the random vocab model,' yet supply no concrete effectiveness metrics (e.g., nDCG@10, Recall@1000), no confidence intervals, and no table of raw scores or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported gains.
minor comments (2)
  1. [Method] The notation for the pruning procedure ('logit score-based queries and documents pruning to max size') is introduced without a formal definition or pseudocode; adding an equation or algorithm box would clarify how the maximum size is enforced separately for queries versus documents.
  2. [Experimental setup] The manuscript refers to 'the evaluation set' without specifying the dataset name, split, or query count; explicit citation of the test collection (e.g., MS MARCO or a proprietary log) is needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in our manuscript on the role of vocabularies in SPLADE models. We address each major comment in detail below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Experimental setup and results (pruning description)] The central claim (abstract and experimental results) that pruned 100K models outperform the 32K SPLADE baseline under a BM25 budget rests on the assumption that logit-score pruning to a fixed maximum size is neutral with respect to vocabulary size. However, the larger candidate pool in the 100K models could systematically select higher-quality or more diverse terms at the same numeric max-size threshold, altering effective sparsity and posting-list costs; no ablation across pruning thresholds, no reporting of post-pruning average non-zeros per query/document, and no comparison of term-quality distributions are provided to rule out this interaction.

    Authors: We appreciate this insightful observation regarding the potential non-neutrality of the pruning process across different vocabulary sizes. To strengthen the manuscript, we will revise Section 4 to include the average number of non-zero terms for queries and documents after pruning for all compared models. We will also add an ablation study examining performance across a range of pruning thresholds (maximum sizes) to demonstrate that the effectiveness gains of the 100K models hold consistently. While a detailed comparison of term-quality distributions would require additional analysis beyond the current scope, the controlled computational budget under BM25 and the superior effectiveness metrics support that the larger vocabulary enables better representational choices. We believe these additions will adequately address the concern. revision: partial

  2. Referee: [Abstract and §4 (results)] The abstract and results section state that the 100K models are 'effective compared to the 32K-sized normal SPLADE model' and that 'ESPLADE models are more effective than the random vocab model,' yet supply no concrete effectiveness metrics (e.g., nDCG@10, Recall@1000), no confidence intervals, and no table of raw scores or statistical significance tests, making it impossible to assess the magnitude or reliability of the reported gains.

    Authors: We acknowledge the need for more quantitative details in the abstract and results presentation. In the revised manuscript, we will expand the abstract to reference specific improvements and include a comprehensive table in Section 4 with raw effectiveness scores (nDCG@10, Recall@1000, etc.) for the baseline and proposed models, both pre- and post-pruning. Confidence intervals will be reported, and we will conduct and report statistical significance tests to validate the differences. This will allow readers to better evaluate the magnitude and reliability of the gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical and self-contained

full rationale

The paper reports an empirical study: BERT models are built with 100K output vocabularies (one ESPLADE-pretrained, one random), fine-tuned on click logs, then pruned by logit scores to a fixed maximum size before comparing effectiveness and retrieval cost against a 32K SPLADE baseline under BM25 budget. All claims rest on these direct experimental measurements of effectiveness and efficiency; the provided text contains no equations, derivations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled through prior work. The derivation chain is therefore absent, and the results do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the untested premise that the chosen pruning strategy and click-log finetuning produce unbiased efficiency comparisons across vocabulary configurations; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Logit-score pruning to a fixed maximum size yields comparable retrieval cost across different vocabulary sizes and initializations.
    Invoked when the authors state that the pruned models operate under the same computational budget as BM25.

pith-pipeline@v0.9.0 · 5801 in / 1404 out tokens · 43666 ms · 2026-05-18T15:51:34.903270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Tokens to Concepts: Leveraging SAE for SPLADE

    cs.IR 2026-04 unverdicted novelty 6.0

    SAE-SPLADE substitutes SPLADE's backbone vocabulary with SAE-derived semantic concepts and matches standard SPLADE performance with better efficiency on in- and out-of-domain tasks.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Sparterm: Learning term-based sparse represen- tation for fast text retrieval.arXiv preprint arXiv:2010.00768,

    Yang Bai, Xiaoguang Li, Gang Wang, Chaoliang Zhang, Lifeng Shang, Jun Xu, Zhaowei Wang, Fangshan Wang, and Qun Liu. Sparterm: Learning term-based sparse represen- tation for fast text retrieval.arXiv preprint arXiv:2010.00768,

  2. [2]

    Unsupervised Cross-lingual Representation Learning at Scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm´ an, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale.arXiv preprint arXiv:1911.02116,

  3. [3]

    Context-aware sentence/passage term importance estimation for first stage retrieval.arXiv preprint arXiv:1910.10687,

    Zhuyun Dai and Jamie Callan. Context-aware sentence/passage term importance estimation for first stage retrieval.arXiv preprint arXiv:1910.10687,

  4. [4]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

  5. [5]

    Language-agnostic BERT sentence embedding, 2022

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding.arXiv preprint arXiv:2007.01852,

  6. [6]

    doi:10.48550/ARXIV.2109.10086

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and St´ ephane Clinchant. Splade v2: Sparse lexical and expansion model for information retrieval.arXiv preprint arXiv:2109.10086,

  7. [7]

    Splate: Sparse late interaction retrieval

    16 The Role of Vocabularies in Learning Sparse Representations for Ranking Thibault Formal, St´ ephane Clinchant, Herv´ e D´ ejean, and Carlos Lassance. Splate: Sparse late interaction retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2635–2640, 2024a. Thibault Formal, Carlo...

  8. [8]

    Improving efficient neural ranking models with cross-architecture knowledge distil- lation.arXiv preprint arXiv:2010.02666,

    Sebastian Hofst¨ atter, Sophia Althammer, Michael Schr¨ oder, Mete Sertkan, and Allan Han- bury. Improving efficient neural ranking models with cross-architecture knowledge distil- lation.arXiv preprint arXiv:2010.02666,

  9. [9]

    In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval

    Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. InProceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 163–173,

  10. [10]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

  11. [11]

    Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation.arXiv preprint arXiv:2110.11540,

    Joel Mackenzie, Andrew Trotman, and Jimmy Lin. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation.arXiv preprint arXiv:2110.11540,

  12. [12]

    Exploring the representation power of splade models

    Joel Mackenzie, Shengyao Zhuang, and Guido Zuccon. Exploring the representation power of splade models. InProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, pages 143–147,

  13. [13]

    Passage Re-ranking with BERT

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert.arXiv preprint arXiv:1901.04085,

  14. [14]

    Document expansion by query prediction.arXiv preprint arXiv:1904.08375,

    Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction.arXiv preprint arXiv:1904.08375,

  15. [15]

    Minimizing flops to learn efficient sparse representations.arXiv preprint arXiv:2004.05665,

    Biswajit Paria, Chih-Kuan Yeh, Ian EH Yen, Ning Xu, Pradeep Ravikumar, and Barnab´ as P´ oczos. Minimizing flops to learn efficient sparse representations.arXiv preprint arXiv:2004.05665,

  16. [16]

    Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2010.08191,

    Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.arXiv preprint arXiv:2010.08191,

  17. [17]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084,

  18. [18]

    Within-document term-based index pruning with statistical hypothesis testing

    Sree Lekha Thota and Ben Carterette. Within-document term-based index pruning with statistical hypothesis testing. InAdvances in Information Retrieval: 33rd European Con- ference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21,

  19. [19]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval,

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learn- ing for dense text retrieval.arXiv preprint arXiv:2007.00808,

  20. [20]

    Sparsifying sparse representations for passage retrieval by top-kmasking.arXiv preprint arXiv:2112.09628,

    Jheng-Hong Yang, Xueguang Ma, and Jimmy Lin. Sparsifying sparse representations for passage retrieval by top-kmasking.arXiv preprint arXiv:2112.09628,

  21. [21]

    Transfer Learning for Low-Resource Neural Machine Translation

    Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low- resource neural machine translation.arXiv preprint arXiv:1604.02201,

  22. [22]

    0.9427 0.7071 0.8112 rand-100k-ts *1 *2 1000 2000

    19 Kim, Lee and Won model topk topk validset q K d K L q L d L j dual loss mask KL L0 q L0 d MRR@10 R@10 R@100 bm25 . . . . . . . . . 0.9427 0.7071 0.8112 rand-100k-ts *1 *2 1000 2000 . . 5 X X 10.81 26.92 0.9285 0.7182 0.8663 splade-32K-ts-0.1m-e1 . . 5 0.2 . X X 9.67 31.03 0.947 0.7298 0.8648 splade-32K-ts-0.1m-e2 . . . . 5 X X 8.05 20.31 0.9418 0.7262 ...

  23. [23]

    The q K and d K are top-k masking from Yang et al

    All models, except otherwise noted, follow these training configurations: The batch size of 7168, the learning rate of 1e-4, 16-bit (mixed) precision training, and the training step of 100K steps. The q K and d K are top-k masking from Yang et al. (2021). The L q, L d is the FLOPS regularizer weight. The L j is the joint FLOPS regularizer weight. The dual...