The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

Hiun Kim; Tae Kwan Lee; Taeryun Won

arxiv: 2605.01407 · v1 · submitted 2026-05-02 · 💻 cs.IR · cs.CL

The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

Hiun Kim , Tae Kwan Lee , Taeryun Won This is my paper

Pith reviewed 2026-05-09 18:14 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords pre-trainingSPLADEinformation retrievalmasked language modelingsparse vectorsfine-tuningpruningweb search

0 comments

The pith

Pre-training Expanded-SPLADE models on general text with higher learning rates produces stronger retrieval results after fine-tuning on web titles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how choices in pre-training corpus and learning rate affect Expanded-SPLADE models when they are later fine-tuned for retrieval on web document titles. Models pre-trained on broad general corpora at higher learning rates achieve better retrieval effectiveness both without pruning and under the strictest pruning of sparse vectors, even though they record lower accuracy on the masked language modeling objective. These same models display higher retrieval costs and greater variation in the lengths of individual term postings lists once pruning becomes severe. Repeating the general pre-training data shows almost no additional benefit to final retrieval performance. The findings indicate that standard pre-training practices may not fully align with the demands of sparse retrieval fine-tuning.

Core claim

Fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. In the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. The repetition of the general pre-training dataset does not have much effect on retrieval effectiveness.

What carries the argument

Expanded-SPLADE (ESPLADE) models that reuse the masked language modeling layer at fine-tuning time to generate sparse vector representations for retrieval.

Load-bearing premise

Differences in retrieval effectiveness and cost arise mainly from the choice of pre-training corpus and learning rate rather than from fine-tuning details, model size, or traits of the web document titles dataset.

What would settle it

A controlled experiment that pre-trains otherwise identical models on the same or different corpora while varying only the learning rate, then applies identical fine-tuning and evaluates retrieval metrics plus cost across pruning levels to check whether the reported effectiveness patterns hold.

Figures

Figures reproduced from arXiv: 2605.01407 by Hiun Kim, Tae Kwan Lee, Taeryun Won.

**Figure 1.** Figure 1: Result on Evaluation Set of Fine-tuned Models with Different Pre-trained EMLM Models Models are trained on trainset (see Appendix C.1 for the details), with top-k masking (Yang et al., 2021) of q_K=1000, d_K=2000. Q, D pruning (see Appendix D for the detailed explanation) is applied for the rightmost node to the leftmost node (qk=0, dk=0; not pruned), (qk=7, dk=20), (qk=5, dk=20), (qk=5, dk=10), respective… view at source ↗

**Figure 2.** Figure 2: Accuracies of EMLM Pre-training. The logit-score-std value of each model is associated with the model name in the legend. The value is also used to color each line of the graph. effectiveness and efficiency in a strict pruning setting. The variance in the length of the individual postings list originated from the differences in pre-training can be related. 4.2 Retrieval Effectiveness and Pre-training Accur… view at source ↗

**Figure 3.** Figure 3: Losses of EMLM Pre-training. The logit-score-std value of each model is associated with the model name in the legend. The value is also used to color each line of the graph. fine-tuned are emlm-ptd-indep-repeat-lr-h and emlm-ptd-indep-uniq-lr-h, which correspond to the indep-repeat-lr-h and indep-uniq-lr-h, respectively. These models have a higher EMLM logit-score-std score and show lower accuracies. The view at source ↗

**Figure 4.** Figure 4: Losses & Accuracies of EMLM Pre-training on Longer Steps. The logitscore-std value of the model is associated with the model name in the legend. Losses & accuracies of EMLM pre-training on longer steps are depicted in view at source ↗

read the original abstract

Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This ablation finds that general-corpus pre-training plus higher learning rate helps Expanded-SPLADE retrieval on web titles but the attribution is weakened by uncontrolled fine-tuning factors and thin reporting.

read the letter

The key points are that fine-tuned ESPLADE models pre-trained on a general corpus with higher learning rate tend to show better retrieval effectiveness both without pruning and at the strictest pruning levels, even with lower MLM accuracy, while also incurring higher retrieval cost and more variable postings-list lengths in the pruned case. Repeating the general pre-training data adds little. These are the main empirical observations from the abstract and summary.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study examining the impact of pre-training corpus choice (general vs. in-house web document titles) and learning rate on Expanded-SPLADE (ESPLADE) models. After MLM pre-training and retrieval fine-tuning, the authors report that higher-effectiveness models at both unpruned and strictly pruned settings are predominantly those pre-trained on general corpora with higher learning rates (despite lower MLM accuracy). In the strictest pruning regime these models also exhibit higher retrieval cost and greater variance in individual postings-list lengths. A third observation is that repeating the general pre-training data has negligible effect on downstream effectiveness. The work highlights potential misalignment between MLM pre-training and ESPLADE fine-tuning objectives together with an effectiveness-cost trade-off under aggressive pruning.

Significance. If the reported patterns survive proper controls and statistical validation, the observations would supply useful empirical guidance for initializing sparse retrieval models and would underscore the imperfect transfer from MLM pre-training to fine-tuned bi-encoder performance. The explicit documentation of the effectiveness-cost trade-off at strict pruning levels is a concrete, actionable finding. At present the evidential basis is too preliminary for strong claims about causality or generalizability.

major comments (3)

[Abstract] Abstract: the central claim that 'fine-tuned models of higher retrieval effectiveness ... are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate' is presented without any quantitative support (number of configurations tested, exact proportions, or selection criteria), rendering the qualifier 'mostly' unverifiable and load-bearing for the paper's main conclusion.
[Experimental description] Experimental description (as summarized in the abstract and methods outline): no statement confirms that fine-tuning hyperparameters (learning rate, epochs, batch size, optimizer, random seeds) were held constant across the different pre-training corpora and learning-rate conditions. Without such controls the attribution of effectiveness differences to pre-training corpus and LR cannot be isolated from confounding factors.
[Results] Results and observations: the three-fold findings are reported as directional observations without error bars, multiple independent runs, or any statistical significance tests. This absence directly undermines the reliability of the reported differences in retrieval effectiveness, cost, and postings-list variance, especially under the 'most strict pruned setting'.

minor comments (2)

[Abstract] The abstract introduces 'Expanded-SPLADE (ESPLADE)' and 'SPLADE-style model' without a concise definition or citation to the original SPLADE work, which would aid readers unfamiliar with the architecture.
[Abstract] Quantitative descriptors such as 'higher-level retrieval cost' and 'higher variance in the length of the individual postings list' would benefit from explicit units or relative percentages to make the trade-off observation more precise.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our empirical study of Expanded-SPLADE pre-training. We address each major comment below, indicating where revisions will be made to improve clarity, controls, and evidential support.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'fine-tuned models of higher retrieval effectiveness ... are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate' is presented without any quantitative support (number of configurations tested, exact proportions, or selection criteria), rendering the qualifier 'mostly' unverifiable and load-bearing for the paper's main conclusion.

Authors: We agree that the abstract would be strengthened by explicit quantitative support for the claim. The experiments evaluated a range of pre-training configurations combining corpus choice and learning rates, with the results section detailing which yielded higher effectiveness. In the revision we will update the abstract to state the total number of configurations tested and the proportion (or exact count) that align with the reported pattern, replacing the qualifier 'mostly' with a precise description such as 'the majority of' or a specific fraction. revision: yes
Referee: [Experimental description] Experimental description (as summarized in the abstract and methods outline): no statement confirms that fine-tuning hyperparameters (learning rate, epochs, batch size, optimizer, random seeds) were held constant across the different pre-training corpora and learning-rate conditions. Without such controls the attribution of effectiveness differences to pre-training corpus and LR cannot be isolated from confounding factors.

Authors: Fine-tuning hyperparameters were held constant across all pre-training conditions to isolate the effects of corpus and learning rate. The same values were used for learning rate, number of epochs, batch size, optimizer, and random seeds in every fine-tuning run. We will add an explicit paragraph in the revised Experimental Setup section stating these fixed hyperparameters and confirming they were identical for all models. revision: yes
Referee: [Results] Results and observations: the three-fold findings are reported as directional observations without error bars, multiple independent runs, or any statistical significance tests. This absence directly undermines the reliability of the reported differences in retrieval effectiveness, cost, and postings-list variance, especially under the 'most strict pruned setting'.

Authors: We acknowledge that the results are presented without error bars or formal statistical tests. Each configuration was run once due to the substantial computational cost of pre-training on large corpora. The directional patterns were consistent across pruning regimes and multiple retrieval metrics. In the revision we will add a limitations paragraph noting the single-run design and the absence of statistical significance testing, while emphasizing the consistency of trends; we will also explore adding limited additional runs with varied seeds for the key configurations if resources permit. revision: partial

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential predictions

full rationale

This paper is an empirical investigation of pre-training effects on Expanded-SPLADE models using in-house web document titles. It reports three observations drawn from experimental comparisons of pre-training corpora, learning rates, and dataset repetition, with no equations, first-principles derivations, or predictions that reduce to fitted parameters by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claims rest on direct experimental outcomes rather than any closed logical loop, making the study self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard machine-learning assumptions about transfer from masked language modeling to retrieval tasks and the representativeness of the chosen web titles data; no new entities are postulated.

free parameters (1)

pre-training learning rate
Higher value is associated with better downstream retrieval but lower MLM accuracy; chosen as an experimental variable.

axioms (1)

domain assumption Masked language modeling pre-training supplies a useful initialization for SPLADE-style retrieval fine-tuning.
Explicitly stated as the motivation and setup for the experiments in the abstract.

pith-pipeline@v0.9.0 · 5596 in / 1382 out tokens · 88710 ms · 2026-05-09T18:14:40.809109+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

love" can be compounded by two WordPiece subword vocabulary

to sparsify query and document vectors while minimizing the ranking loss. The ESPLADE models (Dudek et al., 2023) propose a joint FLOPS regularization loss, which sparsifies the intersection of query and document, which are more directly related to the retrieval efficiency. (Yang et al., 2021) presents ways to learning more sparsified representations by f...

work page 2023
[2]

most frequent unigrams

Due to the diverse nature of queries and documents that we possess, the dataset is multilingual, but centric primarily to Korean. There have been concepts in quantitative linguistics, such as Zipf’s law, and recent work (e.g., Zoph et al. (2016); Johnson et al. (2017); Feng et al. (2020)) in the NLP community identifies the similarity in the latent struct...

work page 2016
[3]

non-neg-terms-avg

Rough example can be found in Table 2 and Table 1 show respectively. 27 Kim, Lee and Won Appendix E. Statistics of EMLM Logit Vectors doc-score-avg topk doc-score-std topk logit- score-std non-neg- terms-avg non-neg- terms-stdemlm model name 10 100 all 10 100 all emlm-ptd-overlap-repeat-lr-l 11.9609 8.4609 -1.4414 2.66 1.63 2.54 5.82 42,791 36,406 emlm-pt...

work page
[4]

Currently, we expect this is due to overfitting, as repetition in the corpus can make predictions easier, the model can have a higher margin to positive logit and other logits on the basis of classification loss, with multiple steps of optimization in a similar input context. This can decrease entropy in score values and entropy of non-negative scored log...

work page 2016
[5]

One possible interpretation of metrics is that the MLM model gives higher logit-score-std, non-neg-terms-avg, and non-neg-terms-std when the input data is in-fine-tuning

The pre-training dataset of emlm-ptd-overlap-repeat-* is overlapped with trainset, which has a similar distribution to the validset. One possible interpretation of metrics is that the MLM model gives higher logit-score-std, non-neg-terms-avg, and non-neg-terms-std when the input data is in-fine-tuning. For a speculation, we can assume that masking the sam...

work page 2016
[6]

For all models, compared to the models with a lower learning rate, logit-score-std is increased; however, non-neg-terms-avg and non-neg-terms-std are decreased

This can be interpretable as the higher learning rate promotes higher variance on the use of logit indices in terms of scores, and even more when the pre-training corpus is general. For all models, compared to the models with a lower learning rate, logit-score-std is increased; however, non-neg-terms-avg and non-neg-terms-std are decreased. A possible con...

work page
[7]

This can be related to the result of both in-fine-tuning pretraining and in-fine-tuning validation data, where such alignment can increase the absolute scale of scores

work page
[8]

This can be related to the combination of a higher learning rate and repeated and general pre-training data

work page
[9]

30 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles Appendix F

This might represent that diverse prediction labels from non-repeated pre-training data are helpful for increasing the score scale, even in the data is general. 30 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles Appendix F. The Effect of Pre-training Steps emlm-ptd-indep-uniq-lr-h (9.65) 0 0.5 1 1.5 2 2.5 3 ·106 1 1.2 1.4 1.6 1.8 2...

work page
[10]

(2025) shows that the effect of such distributions results in low retrieval effectiveness when pruned

Kim et al. (2025) shows that the effect of such distributions results in low retrieval effectiveness when pruned. 35 Kim, Lee and Won Appendix H. Postings List Length Variances and Retrieval Effectiveness, Efficiency Section 4.1.4 shows that the relationship of higher variances of postings list length to higher retrieval effectiveness and decreased retrie...

work page 2025
[11]

Small World Structures

Zipfian-like semantic structure of a general natural language corpus, where literatures characterize them as "Small World Structures" whose structures have strong local clustering and low distances, having hubs that have a high number of connections (e.g., Steyvers and Tenenbaum (2005); Cancho and Solé (2001)). Although our learned sparse representations ...

work page 2005
[12]

where is the capital of Canada?

For example, the representation of "where is the capital of Canada?" should be similar to "where is the capital of Japan?", and can be less similar to "Ottawa" in the MLM pre-training, as the typical MLM does not map Q and D together, only learns the overall context of individual sentences. Some similarity can also be provided between "where is the capita...

work page
[13]

This understanding led us to think about a method to evaluate whether the model is underfitting for the retrieval downstream task (i.e., by too much regularization from too much MLM pre-training) as well as evaluate whether the model will be overfit in the retrieval downstream task (i.e., by too little regularization from too little MLM pre-training). For...

work page 2019
[14]

This can be viewed as an additional bottleneck layer of the model, in addition to the interactional bottleneck of Q, D representations compared to the cross-encoder models (Nogueira and Cho, 2019; Luan et al., 2021)

work page 2019
[15]

For this purpose, we can think of the role of sparse neural matching as a decomposer, or technically, a hash function that divides common concepts into fine-grained concepts. 38 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles of using larger output vocabularies instead, which is directly related to the informativeness of output spa...

work page 2025
[16]

Whether it is big or small, such a trade-off can exist as a similarity component in the loss changes

However, there is a possibility that the models tend to output more terms in this case to maintain informativeness of representations where cosine similarity only concerns angle, as different scales of scores with the same angle are indistinguishable. Whether it is big or small, such a trade-off can exist as a similarity component in the loss changes. 40 ...

work page 2009
[17]

Search Engine Journal

e.g., Google Revisits 15% Unseen Queries Statistic In Context Of AI Search. Search Engine Journal. 2025. 41 Kim, Lee and Won forces can mainly focus on covering instances of Q and D generalizable from its training set. We expect the above hypothesis can be related to a spurious correlation caused by less generalizable input. In a practical case, we observ...

work page 2025

[1] [1]

love" can be compounded by two WordPiece subword vocabulary

to sparsify query and document vectors while minimizing the ranking loss. The ESPLADE models (Dudek et al., 2023) propose a joint FLOPS regularization loss, which sparsifies the intersection of query and document, which are more directly related to the retrieval efficiency. (Yang et al., 2021) presents ways to learning more sparsified representations by f...

work page 2023

[2] [2]

most frequent unigrams

Due to the diverse nature of queries and documents that we possess, the dataset is multilingual, but centric primarily to Korean. There have been concepts in quantitative linguistics, such as Zipf’s law, and recent work (e.g., Zoph et al. (2016); Johnson et al. (2017); Feng et al. (2020)) in the NLP community identifies the similarity in the latent struct...

work page 2016

[3] [3]

non-neg-terms-avg

Rough example can be found in Table 2 and Table 1 show respectively. 27 Kim, Lee and Won Appendix E. Statistics of EMLM Logit Vectors doc-score-avg topk doc-score-std topk logit- score-std non-neg- terms-avg non-neg- terms-stdemlm model name 10 100 all 10 100 all emlm-ptd-overlap-repeat-lr-l 11.9609 8.4609 -1.4414 2.66 1.63 2.54 5.82 42,791 36,406 emlm-pt...

work page

[4] [4]

Currently, we expect this is due to overfitting, as repetition in the corpus can make predictions easier, the model can have a higher margin to positive logit and other logits on the basis of classification loss, with multiple steps of optimization in a similar input context. This can decrease entropy in score values and entropy of non-negative scored log...

work page 2016

[5] [5]

One possible interpretation of metrics is that the MLM model gives higher logit-score-std, non-neg-terms-avg, and non-neg-terms-std when the input data is in-fine-tuning

The pre-training dataset of emlm-ptd-overlap-repeat-* is overlapped with trainset, which has a similar distribution to the validset. One possible interpretation of metrics is that the MLM model gives higher logit-score-std, non-neg-terms-avg, and non-neg-terms-std when the input data is in-fine-tuning. For a speculation, we can assume that masking the sam...

work page 2016

[6] [6]

For all models, compared to the models with a lower learning rate, logit-score-std is increased; however, non-neg-terms-avg and non-neg-terms-std are decreased

This can be interpretable as the higher learning rate promotes higher variance on the use of logit indices in terms of scores, and even more when the pre-training corpus is general. For all models, compared to the models with a lower learning rate, logit-score-std is increased; however, non-neg-terms-avg and non-neg-terms-std are decreased. A possible con...

work page

[7] [7]

This can be related to the result of both in-fine-tuning pretraining and in-fine-tuning validation data, where such alignment can increase the absolute scale of scores

work page

[8] [8]

This can be related to the combination of a higher learning rate and repeated and general pre-training data

work page

[9] [9]

30 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles Appendix F

This might represent that diverse prediction labels from non-repeated pre-training data are helpful for increasing the score scale, even in the data is general. 30 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles Appendix F. The Effect of Pre-training Steps emlm-ptd-indep-uniq-lr-h (9.65) 0 0.5 1 1.5 2 2.5 3 ·106 1 1.2 1.4 1.6 1.8 2...

work page

[10] [10]

(2025) shows that the effect of such distributions results in low retrieval effectiveness when pruned

Kim et al. (2025) shows that the effect of such distributions results in low retrieval effectiveness when pruned. 35 Kim, Lee and Won Appendix H. Postings List Length Variances and Retrieval Effectiveness, Efficiency Section 4.1.4 shows that the relationship of higher variances of postings list length to higher retrieval effectiveness and decreased retrie...

work page 2025

[11] [11]

Small World Structures

Zipfian-like semantic structure of a general natural language corpus, where literatures characterize them as "Small World Structures" whose structures have strong local clustering and low distances, having hubs that have a high number of connections (e.g., Steyvers and Tenenbaum (2005); Cancho and Solé (2001)). Although our learned sparse representations ...

work page 2005

[12] [12]

where is the capital of Canada?

For example, the representation of "where is the capital of Canada?" should be similar to "where is the capital of Japan?", and can be less similar to "Ottawa" in the MLM pre-training, as the typical MLM does not map Q and D together, only learns the overall context of individual sentences. Some similarity can also be provided between "where is the capita...

work page

[13] [13]

This understanding led us to think about a method to evaluate whether the model is underfitting for the retrieval downstream task (i.e., by too much regularization from too much MLM pre-training) as well as evaluate whether the model will be overfit in the retrieval downstream task (i.e., by too little regularization from too little MLM pre-training). For...

work page 2019

[14] [14]

This can be viewed as an additional bottleneck layer of the model, in addition to the interactional bottleneck of Q, D representations compared to the cross-encoder models (Nogueira and Cho, 2019; Luan et al., 2021)

work page 2019

[15] [15]

For this purpose, we can think of the role of sparse neural matching as a decomposer, or technically, a hash function that divides common concepts into fine-grained concepts. 38 The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles of using larger output vocabularies instead, which is directly related to the informativeness of output spa...

work page 2025

[16] [16]

Whether it is big or small, such a trade-off can exist as a similarity component in the loss changes

However, there is a possibility that the models tend to output more terms in this case to maintain informativeness of representations where cosine similarity only concerns angle, as different scales of scores with the same angle are indistinguishable. Whether it is big or small, such a trade-off can exist as a similarity component in the loss changes. 40 ...

work page 2009

[17] [17]

Search Engine Journal

e.g., Google Revisits 15% Unseen Queries Statistic In Context Of AI Search. Search Engine Journal. 2025. 41 Kim, Lee and Won forces can mainly focus on covering instances of Q and D generalizable from its training set. We expect the above hypothesis can be related to a spurious correlation caused by less generalizable input. In a practical case, we observ...

work page 2025