pith. machine review for the scientific record. sign in

arxiv: 2603.19339 · v2 · submitted 2026-03-19 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

Spectral Tempering for Embedding Compression in Dense Passage Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:54 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords dense passage retrievaldimensionality reductionspectral scalingembedding compressioninformation retrievalprincipal component analysissignal-to-noise ratio
0
0 comments X

The pith

Spectral Tempering derives an adaptive scaling factor from the embedding spectrum to compress dense retrieval vectors without training or tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the optimal power coefficient for scaling dimensions in embedding compression is not a single fixed value but changes with the target number of dimensions according to the local signal-to-noise profile of the retained subspace. It introduces Spectral Tempering as a procedure that reads this profile directly from the corpus eigenspectrum through local SNR analysis and knee-point normalization, then applies the resulting gamma(k) to reweight dimensions. Because the method needs only the unlabelled corpus embeddings, it eliminates grid search over hyperparameters and labeled validation data while reaching accuracy close to the best manually tuned baseline across multiple target sizes.

Core claim

The optimal scaling strength gamma for spectral reweighting of retrieval embeddings varies systematically with target dimensionality k and is governed by the signal-to-noise ratio of the retained subspace; Spectral Tempering estimates this strength from local SNR analysis and knee-point normalization performed solely on the corpus eigenspectrum, yielding a learning-free, model-agnostic gamma(k) that matches the performance of grid-searched optima.

What carries the argument

Spectral Tempering (SpecTemp), which computes an adaptive gamma(k) by local SNR analysis and knee-point normalization on the corpus eigenspectrum to set the scaling strength for each target dimensionality.

If this is right

  • Dimensionality reduction for dense retrieval no longer requires per-task hyperparameter search or labeled validation sets.
  • The same procedure applies unchanged to embeddings produced by any model, because it uses only the corpus eigenspectrum.
  • Compressed vectors retain near-oracle retrieval quality for every chosen output dimension.
  • Storage and query latency in production retrieval systems can be reduced while preserving accuracy without additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectrum-driven adaptation could be tested on other embedding tasks such as sentence clustering or recommendation ranking.
  • If the SNR-knee pattern proves stable across domains, the method offers a template for making other post-hoc compression techniques parameter-free.
  • The approach suggests that many high-dimensional embedding spaces share a predictable decay structure that can be exploited without supervision.

Load-bearing premise

The claim that the optimal gamma can be recovered accurately from the eigenspectrum alone via local SNR analysis and knee-point normalization without any task labels.

What would settle it

On a held-out query set, apply the derived gamma(k) to compress embeddings and measure retrieval metrics; if accuracy falls below that of a simple fixed-gamma baseline or PCA for the same k, the adaptive estimation fails.

Figures

Figures reproduced from arXiv: 2603.19339 by Evangelos Kanoulas, Panagiotis Eustratiadis, Yongkang Li.

Figure 1
Figure 1. Figure 1: Consistent spectral structure of dense retrieval em [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance consistency on NQ across additional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Dimensionality reduction is critical for deploying dense retrieval systems at scale, yet mainstream post-hoc methods face a fundamental trade-off: principal component analysis (PCA) preserves dominant variance but underutilizes representational capacity, while whitening enforces isotropy at the cost of amplifying noise in the heavy-tailed eigenspectrum of retrieval embeddings. Intermediate spectral scaling methods unify these extremes by reweighting dimensions with a power coefficient $\gamma$, but treat $\gamma$ as a fixed hyperparameter that requires task-specific tuning. We show that the optimal scaling strength $\gamma$ is not a global constant: it varies systematically with target dimensionality $k$ and is governed by the signal-to-noise ratio (SNR) of the retained subspace. Based on this insight, we propose Spectral Tempering (\textbf{SpecTemp}), a learning-free method that derives an adaptive $\gamma(k)$ directly from the corpus eigenspectrum using local SNR analysis and knee-point normalization, requiring no labeled data or validation-based search. Extensive experiments demonstrate that Spectral Tempering consistently achieves near-oracle performance relative to grid-searched $\gamma^*(k)$ while remaining fully learning-free and model-agnostic. Our code is publicly available at https://github.com/liyongkang123/SpecTemp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Spectral Tempering (SpecTemp), a learning-free method for compressing dense retrieval embeddings. It claims that the optimal spectral scaling exponent γ is not fixed but varies systematically with target dimensionality k, and can be derived directly from the corpus covariance eigenspectrum via local signal-to-noise ratio analysis followed by knee-point normalization, yielding retrieval performance (nDCG/MRR) close to that of an oracle γ* obtained by grid search on labeled validation data, without any supervised tuning or model-specific training.

Significance. If the central claim holds, SpecTemp would remove a practical bottleneck in deploying dense retrievers at scale by eliminating validation-based hyperparameter search while still outperforming standard PCA and whitening baselines. The fully unsupervised, model-agnostic character and public code release are clear strengths that could influence production embedding pipelines.

major comments (2)
  1. [§3.2] §3.2 (derivation of γ(k)): the mapping from local SNR knee-point on the corpus eigenspectrum to the scaling strength γ(k) is introduced as a heuristic without a derivation or proof that the resulting γ preserves query-document similarity rankings under the dot-product or cosine metrics used at inference; this step is load-bearing for the 'near-oracle' guarantee.
  2. [§4] §4 (experimental validation): the abstract asserts 'near-oracle' performance, yet the reported results must include concrete deltas (e.g., nDCG@10 or MRR differences versus grid-searched γ* and versus PCA/whitening) on standard benchmarks such as MS MARCO and Natural Questions, together with ablations on the knee-detection rule and sensitivity to finite-sample eigenspectrum estimation.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'extensive experiments demonstrate' should be accompanied by at least one headline quantitative result (e.g., 'within 0.5% of oracle nDCG@10 on MS MARCO') to allow readers to gauge the strength of the claim immediately.
  2. [§3.1] Notation: the precise definition of 'local SNR' and the knee-point detection algorithm (e.g., which curvature or slope threshold is used) should be stated with an explicit equation or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (derivation of γ(k)): the mapping from local SNR knee-point on the corpus eigenspectrum to the scaling strength γ(k) is introduced as a heuristic without a derivation or proof that the resulting γ preserves query-document similarity rankings under the dot-product or cosine metrics used at inference; this step is load-bearing for the 'near-oracle' guarantee.

    Authors: We acknowledge that the mapping from the local SNR knee-point to γ(k) is presented as a heuristic rather than a formally derived quantity. The motivation stems from the observation that the knee identifies the transition from signal-dominated to noise-dominated dimensions, after which a tempered scaling (γ(k) < 1) prevents noise amplification while preserving relative similarities under dot-product and cosine metrics. Although we do not supply a closed-form proof that this choice exactly preserves rankings, the method is grounded in the eigenspectrum properties of retrieval embeddings and is validated by consistently achieving near-oracle performance across benchmarks. In the revised manuscript we will expand §3.2 with additional intuition and a small-scale analytic example illustrating why the knee-normalized γ maintains ordering of query-document scores. revision: partial

  2. Referee: [§4] §4 (experimental validation): the abstract asserts 'near-oracle' performance, yet the reported results must include concrete deltas (e.g., nDCG@10 or MRR differences versus grid-searched γ* and versus PCA/whitening) on standard benchmarks such as MS MARCO and Natural Questions, together with ablations on the knee-detection rule and sensitivity to finite-sample eigenspectrum estimation.

    Authors: We agree that explicit numerical deltas and additional ablations will make the experimental claims more precise. The current version reports that SpecTemp is close to the oracle but does not tabulate exact differences. In the revision we will add tables in §4 showing nDCG@10 and MRR deltas versus both the grid-searched γ* and the PCA/whitening baselines on MS MARCO and Natural Questions. We will also include ablations on alternative knee-detection procedures (e.g., curvature-based vs. threshold-based) and sensitivity experiments that subsample the corpus to assess finite-sample eigenspectrum stability. revision: yes

standing simulated objections not resolved
  • A formal derivation or proof that the SNR knee-point heuristic exactly preserves query-document similarity rankings under dot-product or cosine metrics is not available; the claim rests on empirical evidence.

Circularity Check

0 steps flagged

No significant circularity: adaptive γ(k) derived directly from unlabeled corpus eigenspectrum

full rationale

The paper's central derivation computes γ(k) via local SNR analysis and knee-point normalization on the corpus covariance eigenspectrum alone, using only unlabeled data. This procedure does not reduce by construction to any fitted parameter, task label, or self-referential definition inside the paper; the output γ(k) is produced from spectral statistics without presupposing retrieval performance. No self-citations are load-bearing for the uniqueness or correctness of the mapping, and the method is explicitly learning-free. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard linear-algebraic spectral decomposition of the embedding matrix and a domain assumption that the eigenspectrum encodes usable signal-to-noise information; no free parameters are introduced and no new entities are postulated.

axioms (1)
  • domain assumption The eigenspectrum of the corpus embeddings reflects the signal-to-noise ratio across subspaces
    Invoked to justify deriving the adaptive scaling strength γ(k) from local SNR analysis.

pith-pipeline@v0.9.0 · 5527 in / 1331 out tokens · 61706 ms · 2026-05-15T08:54:33.444169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset.arXiv preprint arXiv:1611.09268(2016)

  2. [2]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

  3. [3]

    In Ku, L.-W., Martins, A

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. InFindings of the Asso- ciation for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Thailand, 2318–2335. https://doi.org/10.18653/v1/2024. findings-acl.137

  4. [4]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. InProceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 201...

  5. [5]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

  6. [6]

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Sedigheh Eslami, Scott Martens, Bo Wang, Nan Wang, and Han Xiao

  7. [7]

    arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

    jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval. arXiv:2506.18902 [cs.AI] https://arxiv.org/abs/2506.18902

  8. [8]

    Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021. ACM, 113–122. https://doi.o...

  9. [9]

    Junjie Huang, Duyu Tang, Wanjun Zhong, Shuai Lu, Linjun Shou, Ming Gong, Daxin Jiang, and Nan Duan. 2021. WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. InFindings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 238–244. https://aclanthology.org/2021...

  10. [10]

    Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search.IEEE Trans. Pattern Anal. Mach. Intell.33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57

  11. [11]

    William B Johnson, Joram Lindenstrauss, et al . 1984. Extensions of Lipschitz mappings into a Hilbert space.Contemporary mathematics26, 189-206 (1984), 1. https://api.semanticscholar.org/CorpusID:117819162

  12. [12]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020. Association for Computational Linguistics, 6769–6781. https://doi.org...

  13. [13]

    Kakade, Prateek Jain, and Ali Farhadi

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham M. Kakade, Prateek Jain, and Ali Farhadi. 2022. Matryoshka Representation Learning. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022...

  14. [14]

    https://aclanthology.org/ Q19-1026/

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Resear...

  15. [15]

    Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. On the Sentence Embeddings from Pre-trained Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 9119–9130. https: //aclanthology.org/2020.emnlp-main.733/

  16. [16]

    Yongkang Li. 2026. Understanding and Enhancing Robustness in Dense Informa- tion Retrieval. InAdvances in Information Retrieval - 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 - April 2, SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia Yongkang Li, Panagiotis Eustratiadis, and Evangelos Kanoulas 2026,...

  17. [17]

    Yongkang Li, Panagiotis Eustratiadis, and Evangelos Kanoulas. 2025. Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval. InAdvances in Infor- mation Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part IV (Lecture Notes in Computer Sci- ence, Vol. 15575). Springer, 95–1...

  18. [18]

    Yongkang Li, Panagiotis Eustratiadis, Simon Lupart, and Evangelos Kanoulas

  19. [19]

    InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025

    Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025. ACM, 2452–2462. https://doi.org/10.1145/3726302.3730110

  20. [20]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL] https://arxiv.org/abs/2308.03281

  21. [21]

    Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, Decem- ber 6-10, 2023. Association for Computational Linguistics, 6385–6...

  22. [22]

    Akmal Haidar, and Mehdi Rezagholizadeh

    Vasileios Lioutas, Ahmad Rashid, Krtin Kumar, Md. Akmal Haidar, and Mehdi Rezagholizadeh. 2020. Improving Word Embedding Factorization for Compres- sion Using Distilled Nonlinear Neural Decomposition. InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2020. Association for Computa- tional Linguistics, Online, 2774–2784. https://doi.org/1...

  23. [23]

    Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, and Xueqi Cheng. 2023. Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management(<conf- loc>, <city>Birmingham</city>, <country>United Kin...

  24. [24]

    Zhenghao Liu, Han Zhang, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Xiaohua Li. 2022. Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5692–5698. https://doi.org/10....

  25. [25]

    Meixiu Long, Duolin Sun, Dan Yang, Junjie Wang, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu, and Jiahai Wang. 2025. DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval. arXiv:2508.07995 [cs.IR]

  26. [26]

    Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, and Jimmy Lin. 2021. Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2854–2859. h...

  27. [27]

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- Tuning LLaMA for Multi-Stage Text Retrieval. InProceedings of the 47th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024. ACM, 2421–2425. https://doi.org/10.1145/3626772.3657951

  28. [28]

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Fi- nancial Opinion Mining and Question Answering. InCompanion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018. ACM, 1941–1942. https://doi.org/10.1145/318455...

  29. [29]

    Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InInternational Conference on Learning Representations. https://openreview.net/forum?id=HkuGJ3kCb

  30. [30]

    Zach Nussbaum and Brandon Duderstadt. 2025. Training Sparse Mixture Of Experts Text Embedding Models. arXiv:2502.07972 [cs.CL]

  31. [31]

    Gustavo Penha, Arthur Câmara, and Claudia Hauff. 2022. Evaluating the Robust- ness of Retrieval Pipelines with Query Variation Generators. InAdvances in Infor- mation Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13185). Springer, 397–412. ...

  32. [32]

    Sara Rajaee and Mohammad Taher Pilehvar. 2021. A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistic...

  33. [33]

    Vikas Raunak, Vivek Gupta, and Florian Metze. 2019. Effective Dimensionality Reduction for Word Embeddings. InProceedings of the 4th Workshop on Rep- resentation Learning for NLP (RepL4NLP-2019). Association for Computational Linguistics, Florence, Italy, 235–243. https://aclanthology.org/W19-4328/

  34. [34]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational L...

  35. [35]

    Albrecht, David E

    Ville Satopaa, Jeannie R. Albrecht, David E. Irwin, and Barath Raghavan. 2011. Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior. In 31st IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2011 Workshops), 20-24 June 2011, Minneapolis, Minnesota, USA. IEEE Computer Society, 166–171. https://doi.org/10...

  36. [36]

    2022.When BERT Whitening Introduces Hyperparameters: There Is Always One That Suits You

    Jianlin Su. 2022.When BERT Whitening Introduces Hyperparameters: There Is Always One That Suits You. https://kexue.fm/archives/9079 Chinese blog post

  37. [37]

    Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. Whiten- ing Sentence Representations for Better Semantics and Faster Retrieval. arXiv:2103.15316 [cs.CL] https://arxiv.org/abs/2103.15316

  38. [38]

    Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, and Simone Paolo Ponzetto

  39. [39]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Compu- tational Linguistics, Suzhou, China, 27705–27726. https://doi.org/10.18653/v1/ 2025.emnlp-main.1410

  40. [40]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

  41. [41]

    FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, 809–819. https:...

  42. [42]

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Martins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie Wu, ...

  43. [43]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training.CoRRabs/2212.03533 (2022). https://doi.org/10.48550/ ARXIV.2212.03533 arXiv:2212.03533

  44. [44]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/foru...

  45. [45]

    Gaifan Zhang, Yi Zhou, and Danushka Bollegala. 2024. Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, Torino, Italia, 6530–6543. https://aclanthology.org/20...

  46. [46]

    Gaifan Zhang, Yi Zhou, and Danushka Bollegala. 2026. CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement. arXiv:2503.17279 [cs.CL] https://arxiv.org/abs/2503.17279

  47. [47]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

  48. [48]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

  49. [49]

    Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poison- ing Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. Association for Computational Linguistics, 13764–13775. https://doi.org/10.18653/V1/2023.E...

  50. [50]

    Chunsheng Zuo and Daniel Khashabi. 2026. More Than Efficiency: Em- bedding Compression Improves Domain Adaptation in Dense Retrieval. arXiv:2601.13525 [cs.IR] https://arxiv.org/abs/2601.13525