pith. machine review for the scientific record. sign in

arxiv: 2604.21511 · v1 · submitted 2026-04-23 · 💻 cs.IR · cs.CL

Recognition: unknown

From Tokens to Concepts: Leveraging SAE for SPLADE

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:26 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords SPLADESparse Auto-Encoderslearned sparse retrievalsemantic conceptsvocabulary replacementretrieval efficiencyin-domain and out-of-domain tasks
0
0 comments X

The pith

Replacing the token vocabulary in SPLADE with semantic concepts from Sparse Auto-Encoders yields comparable retrieval performance with improved efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing the token-based vocabulary in SPLADE with a latent space of semantic concepts learned via Sparse Auto-Encoders. This addresses limitations of token vocabularies such as polysemy and synonymy while easing extension to multi-lingual and multi-modal settings. The authors examine compatibility between the two representations, test different training regimes, and compare the resulting SAE-SPLADE model against standard SPLADE. Experiments indicate that retrieval effectiveness holds steady on both in-domain and out-of-domain tasks while efficiency improves. A reader would care because the substitution points toward retrieval systems that depend less on fixed token sets.

Core claim

To solve the limitation of relying on the underlying backbone vocabulary, which might hinder performance due to polysemicity and synonymy and pose a challenge for multi-lingual and multi-modal usages, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

What carries the argument

SAE-learned latent semantic concepts used as a replacement for the original token vocabulary inside the SPLADE sparse retrieval architecture

If this is right

  • Retrieval performance stays comparable to standard SPLADE on both in-domain and out-of-domain tasks.
  • Efficiency improves relative to the token-based SPLADE model.
  • Training approaches exist that keep the SAE concepts and SPLADE framework compatible.
  • Differences appear in how semantic concepts versus tokens drive sparse representations for retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shift to latent concepts could ease adaptation of sparse retrieval to new languages or modalities that lack strong token vocabularies.
  • Further refinement of the SAE stage might produce even smaller models while preserving ranking quality.
  • The same concept-substitution pattern could be tested on other learned sparse retrieval methods to test its generality.

Load-bearing premise

The latent concepts learned by the SAE capture all retrieval-critical information from the original tokens and the training regimes for the two systems remain compatible.

What would settle it

A substantial drop in retrieval metrics such as nDCG on out-of-domain benchmarks for SAE-SPLADE compared with SPLADE, or the absence of measurable gains in model size or inference speed.

Figures

Figures reproduced from arXiv: 2604.21511 by Basile Van Cooten, Benjamin Piwowarski, Laure Soulier, Mathias Vast, Yuxuan Zong.

Figure 1
Figure 1. Figure 1: The architecture of our SAE-SPLADE model. We [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SAE-SPLADE performance depending on the value [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the SAE-SPLADE model (with TopK [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SAE-SPLADE, which replaces the token vocabulary of SPLADE with latent semantic concepts learned via Sparse Auto-Encoders. The authors examine compatibility between the two representations, explore training regimes, analyze differences from standard SPLADE, and claim that SAE-SPLADE delivers retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while improving efficiency.

Significance. If the empirical claims hold, the work would be significant for sparse retrieval by decoupling models from fixed token vocabularies, potentially mitigating polysemy/synonymy issues and enabling easier multi-lingual or multi-modal extensions. The emphasis on compatibility analysis and difference studies could provide reusable insights into concept-based sparse representations.

major comments (2)
  1. [Abstract] Abstract: The central claim that SAE-SPLADE achieves 'comparable' retrieval performance on in-domain and out-of-domain tasks is stated without any quantitative metrics, baselines, statistical tests, error bars, or ablation details. This absence prevents verification of the result and raises the possibility of post-hoc selection or unaccounted variance.
  2. [Abstract (and associated experiments)] The manuscript's core assumption—that SAE-learned latent concepts form a drop-in replacement for the token vocabulary without losing retrieval-critical information—is load-bearing for the out-of-domain claim. SAE reconstruction necessarily introduces approximation error; if rare terms, polysemous distinctions, or domain-specific collocations are systematically under-represented, OOD performance could degrade even when in-domain results appear comparable. The paper must supply targeted analysis or ablations demonstrating that no such critical signal is lost.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., nDCG@10 delta or efficiency gain) to support the comparability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's focus on strengthening the empirical presentation and robustness analysis. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that SAE-SPLADE achieves 'comparable' retrieval performance on in-domain and out-of-domain tasks is stated without any quantitative metrics, baselines, statistical tests, error bars, or ablation details. This absence prevents verification of the result and raises the possibility of post-hoc selection or unaccounted variance.

    Authors: We agree that the abstract would be strengthened by including specific quantitative support for the performance claims. In the revised version, we will update the abstract to report key metrics such as nDCG@10 on the MS MARCO development set (in-domain) and the average across BEIR datasets (out-of-domain), with direct comparisons to the original SPLADE model and other sparse baselines. We will also indicate that results are averaged over multiple runs and include standard deviations to address variance. While the abstract's length limits full statistical test details, these will be expanded in the main experimental section. This change directly addresses verifiability and reduces concerns about post-hoc selection. revision: yes

  2. Referee: [Abstract (and associated experiments)] The manuscript's core assumption—that SAE-learned latent concepts form a drop-in replacement for the token vocabulary without losing retrieval-critical information—is load-bearing for the out-of-domain claim. SAE reconstruction necessarily introduces approximation error; if rare terms, polysemous distinctions, or domain-specific collocations are systematically under-represented, OOD performance could degrade even when in-domain results appear comparable. The paper must supply targeted analysis or ablations demonstrating that no such critical signal is lost.

    Authors: This concern about potential loss of retrieval-critical information is well-taken and directly relevant to our OOD claims. The manuscript already contains dedicated sections analyzing compatibility between SAE-derived concepts and the token vocabulary, as well as comparative studies of representation differences. To more rigorously address risks from approximation error on rare terms, polysemy, and domain-specific collocations, we will add targeted ablations in the revision. These will include frequency-stratified reconstruction error analysis, case studies on polysemous term handling, and OOD performance breakdowns on BEIR to verify no systematic degradation. We believe these additions will provide the requested evidence that critical signals are preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model proposal and comparison

full rationale

The paper proposes SAE-SPLADE by replacing SPLADE's token vocabulary with SAE-learned latent concepts, then empirically compares retrieval performance on in-domain and out-of-domain tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claim rests on experimental results rather than reducing to its own inputs by construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the general assumption that SAE concepts are semantically richer than tokens.

pith-pipeline@v0.9.0 · 5430 in / 1021 out tokens · 24659 ms · 2026-05-09T20:26:30.758694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 19 canonical work pages · 5 internal anchors

  1. [1]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268(2016)

  2. [2]

    Nikita Balagansky, Yaroslav Aksenov, Daniil Laptev, Vadim Kurochkin, Gleb Gerasimov, Nikita Koriagin, and Daniil Gavrilov. 2025. Train One Sparse Autoen- coder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 10182–10190

  3. [3]

    Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2021. mmarco: A multilingual version of the ms marco passage ranking dataset.arXiv preprint arXiv:2108.13897(2021)

  4. [4]

    Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah

    Trenton Bricken, Adria Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas Turner, Cem Anil, Catherine Denison, Amanda Askell, Robert Lasenby, Yuhuai Wu, Samuel Kravec, Nicholas Schiefer, Thomas Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Katherine Nguyen, Brian McLean, Jack E. Burke, Tristan Hume, Shan Carter, Tom Heni...

  5. [5]

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems (NeurIPS)

  6. [6]

    Bart Bussmann, Patrick Leask, and Neel Nanda. 2024. BatchTopK Sparse Autoen- coders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning. https://openreview.net/forum?id=d4dpOCqybL

  7. [7]

    Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. 2025. Learn- ing multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547(2025)

  8. [8]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216 4, 5 (2024)

  9. [9]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. InText REtrieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/publication/overview-of- the-trec-2020-deep-learning-track/

  10. [10]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. InText RE- trieval Conference (TREC). TREC. https://www.microsoft.com/en-us/research/ publication/overview-of-the-trec-2019-deep-learning-track/

  11. [11]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). ACL, 4171–4186

  12. [12]

    Dudek, Weize Kong, Cheng Li, Mingyang Zhang, and Michael Bender- sky

    Jeffrey M. Dudek, Weize Kong, Cheng Li, Mingyang Zhang, and Michael Bender- sky. 2023. Learning Sparse Lexical Representations Over Specified Vocabularies for Retrieval. InInternational Conference on Information and Knowledge Manage- ment (CIKM)

  13. [13]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. [n. d.]. Towards Effective and Efficient Sparse Neural Information Retrieval. ([n. d.]), 3634912. https://doi.org/10.1145/3634912

  14. [15]

    SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. (2021)

  15. [16]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

  16. [17]

    InACM Conference on Research and Development in Information Retrieval (SIGIR)

    From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. InACM Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2353–2359. https://doi.org/10.1145/3477495. 3531857

  17. [18]

    Thibault Formal, Maxime Louis, Hervé Déjean, and Stéphane Clinchant. 2026. Learning Retrieval Models with Sparse Autoencoders. InInternational Conference on Learning Representations (ICLR). https://openreview.net/pdf?id=TuFjICawSc OpenReview preprint

  18. [19]

    Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InACM Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2288–2292. https://doi.org/10.1145/3404835.3463098

  19. [20]

    Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. InProceedings of the Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL). Association for Computational Linguistics, 3030–3042. https://aclanthology.org/2021.naacl-main.241

  20. [21]

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2025. Scaling and evaluating sparse autoencoders. InProceedings of the International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=tcsZt9ZNKD

  21. [22]

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. InProceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 315–323

  22. [23]

    Nathan Godey, Éric Clergerie, and Benoît Sagot. 2024. Anisotropy is inherent to self-attention in transformers. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 35–48

  23. [24]

    Kyoung-Rok Jang, Junmo Kang, Giwon Hong, Sung-Hyon Myaeng, Joohee Park, Taewon Yoon, and Heecheol Seo. 2021. Ultra-High Dimensional Sparse Represen- tations with Binarization for Efficient Text Retrieval. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 1016–1029

  24. [25]

    Hao Kang, Tevin Wang, and Chenyan Xiong. 2025. Interpret and control dense retrieval with sparse latent features. InProceedings of the 2025 Conference of the From Tokens to Concepts: Leveraging SAE for SPLADE SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Nations of the Americas Chapter of the Association for Computational Linguistics: Human Langu...

  25. [26]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769–6781

  26. [27]

    Hiun Kim, Tae Kwan Lee, and Taeryun Won. 2025. The Role of Vocabularies in Learning Sparse Representations for Ranking.arXiv preprint arXiv:2509.16621 (2025)

  27. [28]

    Carlos Lassance. 2023. Extending English IR methods to multi-lingual IR.arXiv preprint arXiv:2302.14723(2023)

  28. [29]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New Baselines for SPLADE. (2024)

  29. [30]

    Carlos Lassance, Thibault Formal, and Stéphane Clinchant. 2021. Composite Code Sparse Autoencoders for First Stage Retrieval. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machin- ery, New York, NY, USA, 2136–2140. https://doi.o...

  30. [31]

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 278–300

  31. [32]

    Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with Contextualization. InACM Conference on Research and Development in Information Retrieval (SIGIR). ACM, 1573–1576. https://doi.org/10.1145/3397271.3401262

  32. [33]

    Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2021. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation.arXiv preprint arXiv:2110.11540(2021)

  33. [34]

    Antonio Mallia, Torsten Suel, and Nicola Tonellotto. 2024. Faster learned sparse retrieval with block-max pruning. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2411– 2415

  34. [35]

    Aashiq Muhamed, Mona Diab, and Virginia Smith. 2025. Decoding dark matter: Specialized sparse autoencoders for interpreting rare concepts in foundation models. InFindings of the Association for Computational Linguistics: NAACL 2025. 1604–1635

  35. [36]

    Andrew Ng et al. 2011. Sparse autoencoder. (2011)

  36. [37]

    Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, and Andrew Yates. 2024. DyVo: Dynamic Vocabularies for Learned Sparse Re- trieval with Entities. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 767–783. https://doi.org/10.18653/v1/2024.emnlp-main.45

  37. [38]

    Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, and Andrew Yates. 2025. Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector. arXiv preprint arXiv:2510.00671(2025)

  38. [39]

    Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A unified framework for learned sparse retrieval. InEuropean Conference on Information Retrieval. Springer, 101–116

  39. [40]

    Biswajit Paria, Chih-Kuan Yeh, Ian EH Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. 2020. Minimizing flops to learn efficient sparse representations. (2020)

  40. [41]

    Seongwan Park, Taeklim Kim, and Youngjoong Ko. 2025. Decoding Dense Em- beddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval. arXiv preprint arXiv:2506.00041(2025)

  41. [42]

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. 2024. Improving dictionary learning with gated sparse autoencoders.arXiv preprint library(2024)

  42. [43]

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint library (2024)

  43. [44]

    Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

    Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. InText Retrieval Conference. https: //api.semanticscholar.org/CorpusID:41563977

  44. [45]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint library(2019)

  45. [46]

    Keshav Santhanam, Omar Khattab, Christopher Potts, and Matei Zaharia. 2022. PLAID: an efficient engine for late interaction retrieval. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756

  46. [47]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 3715–3734

  47. [48]

    Gil Shamir and Dong Lin. 2022. Reproducibility in Deep Learning and Smooth Activations. https://research.google/blog/reproducibility-in-deep-learning-and- smooth-activations/. Google Research Blog, April 5, 2022

  48. [49]

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models.International Conference on Machine Learning (ICML)(2025)

  49. [50]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, MarieAnne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Édouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models.arXiv preprint libraryabs/2302.13971 (2023)

  50. [51]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval.arXiv preprint library(2022)

  51. [52]

    Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, and Chenyu You. 2025. Beyond ma- tryoshka: Revisiting sparse coding for adaptive representation.arXiv preprint arXiv:2503.01776(2025)

  52. [53]

    Zhichao Xu, Shengyao Zhuang, Crystina Zhang, Xueguang Ma, Yijun Tian, Maitrey Mehta, Jimmy Lin, and Vivek Srikumar. 2026. LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum. arXiv preprint arXiv:2601.01684(2026)

  53. [54]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  54. [55]

    Puxuan Yu, Antonio Mallia, and Matthias Petri. 2024. Improved learned sparse retrieval with corpus-specific vocabularies. InEuropean Conference on Information Retrieval. Springer, 181–194

  55. [56]

    Bruce Croft, Erik G

    Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik G. Learned-Miller, and J. Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing.Proceedings of the 27th ACM International Conference on Information and Knowledge Management(2018). https: //api.semanticscholar.org/CorpusID:52229883

  56. [57]

    Hansi Zeng, Hamed Zamani, and Vishwa Vinay. 2022. Curriculum learning for dense retrieval distillation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1979–1983

  57. [58]

    Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Al- fonso Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin

  58. [59]

    Transactions of the Association for Computational Linguistics11 (2023), 1114–1131

    MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics11 (2023), 1114–1131