arxiv: 2603.19339 · v2 · submitted 2026-03-19 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Spectral Tempering for Embedding Compression in Dense Passage Retrieval

Yongkang Li , Panagiotis Eustratiadis , Evangelos Kanoulas

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:54 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords dense passage retrievaldimensionality reductionspectral scalingembedding compressioninformation retrievalprincipal component analysissignal-to-noise ratio

0 comments

The pith

Spectral Tempering derives an adaptive scaling factor from the embedding spectrum to compress dense retrieval vectors without training or tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the optimal power coefficient for scaling dimensions in embedding compression is not a single fixed value but changes with the target number of dimensions according to the local signal-to-noise profile of the retained subspace. It introduces Spectral Tempering as a procedure that reads this profile directly from the corpus eigenspectrum through local SNR analysis and knee-point normalization, then applies the resulting gamma(k) to reweight dimensions. Because the method needs only the unlabelled corpus embeddings, it eliminates grid search over hyperparameters and labeled validation data while reaching accuracy close to the best manually tuned baseline across multiple target sizes.

Core claim

The optimal scaling strength gamma for spectral reweighting of retrieval embeddings varies systematically with target dimensionality k and is governed by the signal-to-noise ratio of the retained subspace; Spectral Tempering estimates this strength from local SNR analysis and knee-point normalization performed solely on the corpus eigenspectrum, yielding a learning-free, model-agnostic gamma(k) that matches the performance of grid-searched optima.

What carries the argument

Spectral Tempering (SpecTemp), which computes an adaptive gamma(k) by local SNR analysis and knee-point normalization on the corpus eigenspectrum to set the scaling strength for each target dimensionality.

If this is right

Dimensionality reduction for dense retrieval no longer requires per-task hyperparameter search or labeled validation sets.
The same procedure applies unchanged to embeddings produced by any model, because it uses only the corpus eigenspectrum.
Compressed vectors retain near-oracle retrieval quality for every chosen output dimension.
Storage and query latency in production retrieval systems can be reduced while preserving accuracy without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectrum-driven adaptation could be tested on other embedding tasks such as sentence clustering or recommendation ranking.
If the SNR-knee pattern proves stable across domains, the method offers a template for making other post-hoc compression techniques parameter-free.
The approach suggests that many high-dimensional embedding spaces share a predictable decay structure that can be exploited without supervision.

Load-bearing premise

The claim that the optimal gamma can be recovered accurately from the eigenspectrum alone via local SNR analysis and knee-point normalization without any task labels.

What would settle it

On a held-out query set, apply the derived gamma(k) to compress embeddings and measure retrieval metrics; if accuracy falls below that of a simple fixed-gamma baseline or PCA for the same k, the adaptive estimation fails.

Figures

Figures reproduced from arXiv: 2603.19339 by Evangelos Kanoulas, Panagiotis Eustratiadis, Yongkang Li.

**Figure 2.** Figure 2: Performance consistency on NQ across additional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Dimensionality reduction is critical for deploying dense retrieval systems at scale, yet mainstream post-hoc methods face a fundamental trade-off: principal component analysis (PCA) preserves dominant variance but underutilizes representational capacity, while whitening enforces isotropy at the cost of amplifying noise in the heavy-tailed eigenspectrum of retrieval embeddings. Intermediate spectral scaling methods unify these extremes by reweighting dimensions with a power coefficient $\gamma$, but treat $\gamma$ as a fixed hyperparameter that requires task-specific tuning. We show that the optimal scaling strength $\gamma$ is not a global constant: it varies systematically with target dimensionality $k$ and is governed by the signal-to-noise ratio (SNR) of the retained subspace. Based on this insight, we propose Spectral Tempering (\textbf{SpecTemp}), a learning-free method that derives an adaptive $\gamma(k)$ directly from the corpus eigenspectrum using local SNR analysis and knee-point normalization, requiring no labeled data or validation-based search. Extensive experiments demonstrate that Spectral Tempering consistently achieves near-oracle performance relative to grid-searched $\gamma^*(k)$ while remaining fully learning-free and model-agnostic. Our code is publicly available at https://github.com/liyongkang123/SpecTemp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpecTemp derives an adaptive γ(k) from corpus eigenspectrum SNR and knee points for learning-free embedding compression, but the heuristic link to retrieval metrics needs checking.

read the letter

The core new piece is the data-driven construction of γ(k): they analyze local signal-to-noise ratios across the embedding covariance spectrum, locate knee points for normalization, and use that to set the scaling strength per target dimension. This moves past fixed-γ spectral methods or plain PCA/whitening without needing labels or grid search on validation data. The code release helps, and the approach stays model-agnostic, which fits real deployment constraints for already-trained dense retrievers. That part is practical and cleanly motivated from the spectrum properties described in the abstract. The experiments are said to show near-oracle results against grid-searched γ*, which would be the main payoff if the numbers hold across standard IR benchmarks. The soft spot is exactly the one the stress-test flags. Nothing in the derivation directly connects the knee location or local SNR to preservation of query-document similarities under ranking losses like nDCG. It is still a heuristic, so performance could degrade if the spectrum estimate is noisy, if the knee rule is brittle, or if the retained subspace SNR does not track the actual retrieval objective on a new corpus or embedding model. The abstract gives no quantitative deltas or ablation details, so the strength of the claim rests entirely on the full experimental section. This is aimed at people who compress dense passage embeddings for production search or recommendation systems and want something simpler than per-task tuning. A reader who already works with spectral scaling or post-hoc reduction would get immediate value from trying the method and the released code. I would send it to peer review. The problem is real, the method is distinct from prior fixed-parameter work, and the experiments can be scrutinized for whether the heuristic actually delivers on the ranking metrics.

Referee Report

2 major / 2 minor

Summary. The paper proposes Spectral Tempering (SpecTemp), a learning-free method for compressing dense retrieval embeddings. It claims that the optimal spectral scaling exponent γ is not fixed but varies systematically with target dimensionality k, and can be derived directly from the corpus covariance eigenspectrum via local signal-to-noise ratio analysis followed by knee-point normalization, yielding retrieval performance (nDCG/MRR) close to that of an oracle γ* obtained by grid search on labeled validation data, without any supervised tuning or model-specific training.

Significance. If the central claim holds, SpecTemp would remove a practical bottleneck in deploying dense retrievers at scale by eliminating validation-based hyperparameter search while still outperforming standard PCA and whitening baselines. The fully unsupervised, model-agnostic character and public code release are clear strengths that could influence production embedding pipelines.

major comments (2)

[§3.2] §3.2 (derivation of γ(k)): the mapping from local SNR knee-point on the corpus eigenspectrum to the scaling strength γ(k) is introduced as a heuristic without a derivation or proof that the resulting γ preserves query-document similarity rankings under the dot-product or cosine metrics used at inference; this step is load-bearing for the 'near-oracle' guarantee.
[§4] §4 (experimental validation): the abstract asserts 'near-oracle' performance, yet the reported results must include concrete deltas (e.g., nDCG@10 or MRR differences versus grid-searched γ* and versus PCA/whitening) on standard benchmarks such as MS MARCO and Natural Questions, together with ablations on the knee-detection rule and sensitivity to finite-sample eigenspectrum estimation.

minor comments (2)

[Abstract] Abstract: the phrase 'extensive experiments demonstrate' should be accompanied by at least one headline quantitative result (e.g., 'within 0.5% of oracle nDCG@10 on MS MARCO') to allow readers to gauge the strength of the claim immediately.
[§3.1] Notation: the precise definition of 'local SNR' and the knee-point detection algorithm (e.g., which curvature or slope threshold is used) should be stated with an explicit equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (derivation of γ(k)): the mapping from local SNR knee-point on the corpus eigenspectrum to the scaling strength γ(k) is introduced as a heuristic without a derivation or proof that the resulting γ preserves query-document similarity rankings under the dot-product or cosine metrics used at inference; this step is load-bearing for the 'near-oracle' guarantee.

Authors: We acknowledge that the mapping from the local SNR knee-point to γ(k) is presented as a heuristic rather than a formally derived quantity. The motivation stems from the observation that the knee identifies the transition from signal-dominated to noise-dominated dimensions, after which a tempered scaling (γ(k) < 1) prevents noise amplification while preserving relative similarities under dot-product and cosine metrics. Although we do not supply a closed-form proof that this choice exactly preserves rankings, the method is grounded in the eigenspectrum properties of retrieval embeddings and is validated by consistently achieving near-oracle performance across benchmarks. In the revised manuscript we will expand §3.2 with additional intuition and a small-scale analytic example illustrating why the knee-normalized γ maintains ordering of query-document scores. revision: partial
Referee: [§4] §4 (experimental validation): the abstract asserts 'near-oracle' performance, yet the reported results must include concrete deltas (e.g., nDCG@10 or MRR differences versus grid-searched γ* and versus PCA/whitening) on standard benchmarks such as MS MARCO and Natural Questions, together with ablations on the knee-detection rule and sensitivity to finite-sample eigenspectrum estimation.

Authors: We agree that explicit numerical deltas and additional ablations will make the experimental claims more precise. The current version reports that SpecTemp is close to the oracle but does not tabulate exact differences. In the revision we will add tables in §4 showing nDCG@10 and MRR deltas versus both the grid-searched γ* and the PCA/whitening baselines on MS MARCO and Natural Questions. We will also include ablations on alternative knee-detection procedures (e.g., curvature-based vs. threshold-based) and sensitivity experiments that subsample the corpus to assess finite-sample eigenspectrum stability. revision: yes

standing simulated objections not resolved

A formal derivation or proof that the SNR knee-point heuristic exactly preserves query-document similarity rankings under dot-product or cosine metrics is not available; the claim rests on empirical evidence.

Circularity Check

0 steps flagged

No significant circularity: adaptive γ(k) derived directly from unlabeled corpus eigenspectrum

full rationale

The paper's central derivation computes γ(k) via local SNR analysis and knee-point normalization on the corpus covariance eigenspectrum alone, using only unlabeled data. This procedure does not reduce by construction to any fitted parameter, task label, or self-referential definition inside the paper; the output γ(k) is produced from spectral statistics without presupposing retrieval performance. No self-citations are load-bearing for the uniqueness or correctness of the mapping, and the method is explicitly learning-free. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard linear-algebraic spectral decomposition of the embedding matrix and a domain assumption that the eigenspectrum encodes usable signal-to-noise information; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption The eigenspectrum of the corpus embeddings reflects the signal-to-noise ratio across subspaces
Invoked to justify deriving the adaptive scaling strength γ(k) from local SNR analysis.

pith-pipeline@v0.9.0 · 5527 in / 1331 out tokens · 61706 ms · 2026-05-15T08:54:33.444169+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We estimate the noise floor σ²_noise as the mean eigenvalue of the spectral tail... SNR(i)=max(0,(λ_i−σ²_noise)/σ²_noise)... γ(k)=min(1,SNR(k)/S_ref)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

work page
[3]

In Ku, L.-W., Martins, A

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2318–2335. https://doi.org/10.18653/v1/2024. findings-acl.137

work page doi:10.18653/v1/2024 2024
[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 201...

work page doi:10.18653/v1/n19-1423 2019
[5]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Sedigheh Eslami, Scott Martens, Bo Wang, Nan Wang, and Han Xiao

work page
[7]

arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval. arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

work page arXiv
[8]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021. ACM, 113–122. https://doi.o...

work page doi:10.1145/3404835.3462891 2021
[9]

Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. 2021. WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. InFindings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 238–244. https://aclanthology.org/2021...

work page 2021
[10]

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search.IEEE Trans. Pattern Anal. Mach. Intell.33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011
[11]

William B Johnson, Joram Lindenstrauss, et al . 1984. Extensions of Lipschitz mappings into a Hilbert space.Contemporary mathematics26, 189-206 (1984), 1. https://api.semanticscholar.org/CorpusID:117819162

work page 1984
[12]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. Association for Computational Linguistics, 6769–6781. https://doi.org...

work page doi:10.18653/v1/2020.emnlp- 2020
[13]

Kakade, Prateek Jain, and Ali Farhadi

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham M. Kakade, Prateek Jain, and Ali Farhadi. 2022. Matryoshka Representation Learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022...

work page 2022
[14]

https://aclanthology.org/ Q19-1026/

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Resear...

work page doi:10.1162/tacl_a_00276 2019
[15]

Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. On the Sentence Embeddings from Pre-trained Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 9119–9130. https: //aclanthology.org/2020.emnlp-main.733/

work page 2020
[16]

Yongkang Li. 2026. Understanding and Enhancing Robustness in Dense Informa- tion Retrieval. InAdvances in Information Retrieval - 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 - April 2, SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Yongkang Li, Panagiotis Eustratiadis, and Evangelos Kanoulas 2026,...

work page doi:10.1007/978-3-032-21324-2_51 2026
[17]

Yongkang Li, Panagiotis Eustratiadis, and Evangelos Kanoulas. 2025. Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval. InAdvances in Infor- mation Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part IV (Lecture Notes in Computer Sci- ence, Vol. 15575). Springer, 95–1...

work page doi:10.1007/978-3-031-88717-8_8 2025
[18]

Yongkang Li, Panagiotis Eustratiadis, Simon Lupart, and Evangelos Kanoulas

work page
[19]

InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025

Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025. ACM, 2452–2462. https://doi.org/10.1145/3726302.3730110

work page doi:10.1145/3726302.3730110 2025
[20]

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL] https://arxiv.org/abs/2308.03281

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, Decem- ber 6-10, 2023. Association for Computational Linguistics, 6385–6...

work page doi:10.18653/v1/2023.findings-emnlp.423 2023
[22]

Akmal Haidar, and Mehdi Rezagholizadeh

Vasileios Lioutas, Ahmad Rashid, Krtin Kumar, Md. Akmal Haidar, and Mehdi Rezagholizadeh. 2020. Improving Word Embedding Factorization for Compres- sion Using Distilled Nonlinear Neural Decomposition. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020. Association for Computa- tional Linguistics, Online, 2774–2784. https://doi.org/1...

work page doi:10.18653/v1/2020.findings- 2020
[23]

Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. 2023. Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management(<conf- loc>, <city>Birmingham</city>, <country>United Kin...

work page doi:10.1145/3583780.3614793 2023
[24]

Zhenghao Liu, Han Zhang, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Xiaohua Li. 2022. Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5692–5698. https://doi.org/10....

work page doi:10.18653/v1/2022.emnlp-main 2022
[25]

Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, and Jiahai Wang. 2025. DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval. arXiv:2508.07995 [cs.IR]

work page arXiv 2025
[26]

Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, and Jimmy Lin. 2021. Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2854–2859. h...

work page 2021
[27]

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- Tuning LLaMA for Multi-Stage Text Retrieval. InProceedings of the 47th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024. ACM, 2421–2425. https://doi.org/10.1145/3626772.3657951

work page doi:10.1145/3626772.3657951 2024
[28]

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Fi- nancial Opinion Mining and Question Answering. InCompanion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018. ACM, 1941–1942. https://doi.org/10.1145/318455...

work page doi:10.1145/3184558.3192301 2018
[29]

Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InInternational Conference on Learning Representations. https://openreview.net/forum?id=HkuGJ3kCb

work page 2018
[30]

Zach Nussbaum and Brandon Duderstadt. 2025. Training Sparse Mixture Of Experts Text Embedding Models. arXiv:2502.07972 [cs.CL]

work page arXiv 2025
[31]

Gustavo Penha, Arthur Câmara, and Claudia Hauff. 2022. Evaluating the Robust- ness of Retrieval Pipelines with Query Variation Generators. InAdvances in Infor- mation Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13185). Springer, 397–412. ...

work page doi:10.1007/978-3-030-99736-6_27 2022
[32]

Sara Rajaee and Mohammad Taher Pilehvar. 2021. A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistic...

work page 2021
[33]

Vikas Raunak, Vivek Gupta, and Florian Metze. 2019. Effective Dimensionality Reduction for Word Embeddings. InProceedings of the 4th Workshop on Rep- resentation Learning for NLP (RepL4NLP-2019). Association for Computational Linguistics, Florence, Italy, 235–243. https://aclanthology.org/W19-4328/

work page 2019
[34]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational L...

work page doi:10.18653/v1/d19-1410 2019
[35]

Albrecht, David E

Ville Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Minnesota, USA. IEEE Computer Society, 166–171. https://doi.org/10...

work page doi:10.1109/icdcsw.2011.20 2011
[36]

2022.When BERT Whitening Introduces Hyperparameters: There Is Always One That Suits You

Jianlin Su. 2022.When BERT Whitening Introduces Hyperparameters: There Is Always One That Suits You. https://kexue.fm/archives/9079 Chinese blog post

work page 2022
[37]

Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. Whiten- ing Sentence Representations for Better Semantics and Faster Retrieval. arXiv:2103.15316 [cs.CL] https://arxiv.org/abs/2103.15316

work page arXiv 2021
[38]

Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, and Simone Paolo Ponzetto

work page
[39]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Compu- tational Linguistics, Suzhou, China, 27705–27726. https://doi.org/10.18653/v1/ 2025.emnlp-main.1410

work page doi:10.18653/v1/ 2025
[40]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

work page
[41]

FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, 809–819. https:...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[42]

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...

work page arXiv 2025
[43]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training.CoRRabs/2212.03533 (2022). https://doi.org/10.48550/ ARXIV.2212.03533 arXiv:2212.03533

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/foru...

work page 2021
[45]

Gaifan Zhang, Yi Zhou, and Danushka Bollegala. 2024. Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 6530–6543. https://aclanthology.org/20...

work page 2024
[46]

Gaifan Zhang, Yi Zhou, and Danushka Bollegala. 2026. CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement. arXiv:2503.17279 [cs.CL] https://arxiv.org/abs/2503.17279

work page arXiv 2026
[47]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

work page
[48]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poison- ing Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 13764–13775. https://doi.org/10.18653/V1/2023.E...

work page doi:10.18653/v1/2023.emnlp-main.849 2023
[50]

Chunsheng Zuo and Daniel Khashabi. 2026. More Than Efficiency: Em- bedding Compression Improves Domain Adaptation in Dense Retrieval. arXiv:2601.13525 [cs.IR] https://arxiv.org/abs/2601.13525

work page arXiv 2026