pith. machine review for the scientific record. sign in

arxiv: 2604.05480 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.DB

Recognition: 2 theorem links

· Lean Theorem

Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3

classification 💻 cs.CR cs.DB
keywords vector database securitypoisoning attackembedding spacehubnessblack-hole attackretrieval vulnerabilityhigh-dimensional geometryAI security
0
0 comments X

The pith

A few vectors placed near the center of an embedding space can appear in the top results for nearly every query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vector databases are open to a poisoning attack by adding a small set of malicious vectors close to the geometric center of the stored data. These vectors succeed because high-dimensional embeddings leave the centroid region nearly empty, causing any vectors placed there to become nearest neighbors for a large share of other points through centrality-driven hubness. A reader should care because this geometric feature means retrieval systems built on embeddings cannot safely assume their stored vectors are honest or representative. Experiments show the injected vectors entering up to 99.85 percent of top-10 results across tested setups. Standard methods meant to reduce hubness either cut retrieval accuracy sharply or leave the attack largely intact.

Core claim

The Black-Hole Attack works by injecting malicious vectors near the centroid of the existing vectors in a database. In high-dimensional embedding spaces the centroid region stays nearly empty in practice, so vectors located there exhibit centrality-driven hubness and become the nearest neighbor for a disproportionately large number of other vectors. As a result the malicious vectors are returned in the top-k results for most queries, reaching 99.85 percent of top-10 lists in the reported trials. The attack therefore demonstrates that geometric defects make it unsafe to trust vectors in a database without further checks.

What carries the argument

Centrality-driven hubness: the property that vectors placed near the nearly empty centroid of a high-dimensional embedding become nearest neighbors to a disproportionately large number of other vectors.

If this is right

  • A small number of injected vectors can reach high coverage of top-k results without large changes to the database.
  • Existing techniques for lowering hubness either reduce retrieval accuracy or leave most queries still vulnerable to the attack.
  • Retrieval results from vector databases rest on geometric features that attackers can exploit with minimal effort.
  • Secure vector databases will require new defenses that address the empty-centroid property directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same central placement tactic could be tested in other embedding-based systems such as recommendation or semantic search to check for similar exposure.
  • Real-time monitoring for vectors that suddenly appear as neighbors to an unusually large fraction of queries might serve as an early detection signal.
  • The effect may grow stronger as embedding dimension increases, suggesting experiments that vary dimension while holding data size fixed.
  • Applications that treat vector retrieval as ground truth, such as legal or medical document search, may need additional verification layers even when the database itself is not directly poisoned.

Load-bearing premise

High-dimensional embedding spaces in practice leave the centroid region nearly empty, so that any vectors placed there become nearest neighbors to many others.

What would settle it

Measure whether a small set of injected vectors near the centroid of a real embedding dataset appears in the top-10 results for the great majority of held-out queries; consistent failure to appear would show the attack does not work as described.

Figures

Figures reproduced from arXiv: 2604.05480 by Hanxi Li, Jiale Lao, Jianan Zhou, Junfen Wang, Mingjie Tang, Yang Cao, Yibo Wang, Zhengmao Ye.

Figure 1
Figure 1. Figure 1: The Workflow of the Black-Hole Attack Example 1.1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow and attack process of vector database [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Black-Hole Attack workflow 3.2 Attack Overview The Black-Hole Attack is a query-agnostic poisoning attack for vector databases. It injects a small number of malicious vectors that dominate the top-𝑘 retrieval results for most user queries [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical CDF of the distance-to-centroid under Eu [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hubness Probability on Real Embeddings: Fraction of Vectors Nearest to the Centroid. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity to Number of Clusters: MO@10 from 1 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of the Black-Hole Attack on downstream [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Detection-based defense. Top: MO@10 on the poi￾soned database before and after filtering. Bottom: R@10 be￾tween pre- and post-filter results on an unpoisoned corpus 8 Conclusion In this work, we present the Black-Hole Attack, a query-agnostic poisoning attack against vector databases. The attack injects mali￾cious vectors into either the global centroid or multiple cluster-wise centroids of the embedding s… view at source ↗
read the original abstract

Vector databases serve as the retrieval backbone of modern AI applications, yet their security remains largely unexplored. We propose the Black-Hole Attack, a poisoning attack that injects a small number of malicious vectors near the geometric center of the stored vectors. These injected vectors attract queries like a black hole and frequently appear in the top-k retrieval results for most queries. This attack is enabled by a phenomenon we term centrality-driven hubness: in high-dimensional embedding spaces, vectors near the centroid become nearest neighbors of a disproportionately large number of other vectors, while this centroid region is nearly empty in practice. The attack shows that vectors in a vector database cannot be blindly trusted: geometric defects in high-dimensional embeddings make retrieval inherently vulnerable. Our experiments show that malicious vectors appear in up to 99.85% of top-10 results. Additionally, we evaluate existing hubness mitigation methods as potential defenses against the Black-Hole Attack. The results show that these methods either significantly reduce retrieval accuracy or provide limited protection, which indicates the need for more robust defenses against the Black-Hole Attack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Black-Hole Attack, a poisoning attack on vector databases that injects a small number of malicious vectors near the geometric centroid of stored embeddings. It exploits centrality-driven hubness, whereby vectors near the (nearly empty) centroid become nearest neighbors to a disproportionately large fraction of queries in high-dimensional spaces. Experiments report malicious vectors appearing in up to 99.85% of top-10 results, and the authors evaluate existing hubness mitigation methods, finding that they either degrade retrieval accuracy or offer limited protection.

Significance. If the attack generalizes beyond the reported settings, the result would be significant for security of embedding-based retrieval systems that underpin RAG, recommendation, and semantic search. The work supplies concrete empirical attack success rates and a direct evaluation of candidate defenses, which is a positive contribution. These elements provide a falsifiable starting point for further study of geometric vulnerabilities in vector stores.

major comments (2)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The reported peak success rate of 99.85% is presented without any description of the embedding models (e.g., BERT, CLIP), datasets, query distributions, number of injected vectors, or preprocessing (L2 normalization or mean-centering). These omissions are load-bearing because the central claim rests on the centroid region being nearly empty; standard normalization steps common in production embeddings could populate that region and materially weaken the hubness effect.
  2. [Introduction and Attack Construction] Introduction and Attack Construction: The assertion that centrality-driven hubness is an inherent geometric defect making retrieval 'inherently vulnerable' is not accompanied by controls or ablations showing that the effect survives after the mean-centering and unit-norm operations routinely applied to embeddings. Without such evidence the attack's practical scope remains unclear.
minor comments (2)
  1. The manuscript introduces the terms 'centrality-driven hubness' and 'Black-Hole Attack' without a concise comparison table or paragraph relating them to prior hubness-reduction literature (e.g., mutual proximity, local scaling) or to existing poisoning attacks on embeddings.
  2. Notation for the injected vectors and the centroid region is introduced informally; a short formal definition or diagram early in the paper would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our experimental setup and strengthen the claims regarding the robustness of the Black-Hole Attack. We address each major comment below and have prepared a revised manuscript that incorporates additional details and analyses.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The reported peak success rate of 99.85% is presented without any description of the embedding models (e.g., BERT, CLIP), datasets, query distributions, number of injected vectors, or preprocessing (L2 normalization or mean-centering). These omissions are load-bearing because the central claim rests on the centroid region being nearly empty; standard normalization steps common in production embeddings could populate that region and materially weaken the hubness effect.

    Authors: We agree that the abstract and experimental sections require more explicit details to support the reported success rates. In the revised manuscript, we have updated the abstract to briefly note the key experimental parameters and added a new subsection (Section 4.1) describing the embedding models (BERT-base, CLIP ViT-B/32), datasets (MS MARCO for text, ImageNet subsets for images), query sampling (uniform over held-out test sets), number of injected vectors (1 to 10), and preprocessing (L2 normalization applied to all embeddings, with no additional mean-centering beyond model outputs). Our re-analysis confirms that the centroid region remains sparsely populated post-normalization, with the hubness effect intact; we include supporting statistics on centroid occupancy. revision: yes

  2. Referee: [Introduction and Attack Construction] Introduction and Attack Construction: The assertion that centrality-driven hubness is an inherent geometric defect making retrieval 'inherently vulnerable' is not accompanied by controls or ablations showing that the effect survives after the mean-centering and unit-norm operations routinely applied to embeddings. Without such evidence the attack's practical scope remains unclear.

    Authors: We acknowledge the need for explicit controls on standard preprocessing. The original experiments already applied L2 unit-norm normalization to embeddings as is conventional, and the centroid remained nearly empty. To directly address the comment, the revised manuscript adds an ablation study (new Figure 5 and Table 3) that further applies explicit mean-centering before attack injection. Results show the hubness effect and attack success rates (still exceeding 95% in top-10) persist under these operations, supporting that the vulnerability arises from high-dimensional geometry rather than preprocessing artifacts. We have revised the introduction to reference these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack construction with measured success rates

full rationale

The paper presents an empirical poisoning attack that injects vectors near the observed centroid of embedding spaces and measures retrieval success (up to 99.85% in top-10). Centrality-driven hubness is introduced as an observed geometric property in high-dimensional data, supported by experiments across embeddings rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations reduce the attack efficacy to the inputs by construction; the result is falsifiable via external benchmarks on normalized embeddings and remains independent of the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the geometric property of high-dimensional spaces and on the empirical observation that the centroid region is nearly empty; no free parameters are fitted in the abstract description.

axioms (1)
  • domain assumption In high-dimensional embedding spaces, vectors near the centroid become nearest neighbors of a disproportionately large number of other vectors while the centroid region remains nearly empty.
    This is the load-bearing geometric phenomenon invoked to explain why central injections succeed.
invented entities (1)
  • Black-Hole Attack no independent evidence
    purpose: Poisoning attack that places malicious vectors near the embedding centroid to dominate retrieval
    Newly introduced attack concept whose effectiveness is demonstrated only within the paper's experiments.

pith-pipeline@v0.9.0 · 5508 in / 1250 out tokens · 63917 ms · 2026-05-10T19:35:12.492863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 29 canonical work pages · 3 internal anchors

  1. [1]

    Gaurav Bagwe, Lan Zhang, Linke Guo, Miao Pan, Xiaolong Ma, and Xiaoy- ong (Brian) Yuan. 2025. Is Embedding-as-a-Service Safe? Meta-Prompt-Based Backdoor Attacks for User-Specific Trigger Migration.Transactions on Artificial Intelligence(2025). https://api.semanticscholar.org/CorpusID:279133462

  2. [2]

    Jon Bratseth. 2017. Open Sourcing Vespa, Yahoo’s Big Data Processing and Serving Engine. https://blog.vespa.ai/open-sourcing-vespa-yahoos-big-data- processing/. Accessed: 2025-10-12

  3. [3]

    Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.ArXiv abs/1611.09268 (2016). https://api.semanticscholar.org/CorpusID:1289517

  4. [4]

    Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proceedings of the VLDB Endowment17, 12 (Aug. 2024), 3772–3785. doi:10.14778/3685800. 3685805

  5. [5]

    Zhuo Chen, Yuyang Gong, Jiawei Liu, Miaokun Chen, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, and Xiaozhong Liu. 2025. FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(2025). https://api.semanticscholar.org/CorpusID...

  6. [6]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

  7. [7]

    Wei Du, Peixuan Li, Bo Li, Haodong Zhao, and Gongshen Liu. 2023. UOR: Universal Backdoor Attacks on Pre-trained Language Models. InAnnual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/ CorpusID:258714833

  8. [8]

    Nanyi Fei, Yizhao Gao, Zhiwu Lu, and Tao Xiang. 2021. Z-Score Normalization, Hubness, and Few-Shot Learning.2021 IEEE/CVF International Conference on Com- puter Vision (ICCV)(2021), 142–151. https://api.semanticscholar.org/CorpusID: 247191876

  9. [9]

    Alejandro Fuster Baggetto and Victor Fresno. 2022. Is anisotropy really the cause of BERT embeddings not being semantic?. InFindings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4271–4281. doi:10.18653/v1/202...

  10. [10]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey.ArXiv abs/2312.10997 (2023). https://api.semanticscholar.org/CorpusID:266359151

  11. [11]

    Runpeng Geng, Yanting Wang, Ying Chen, and Jinyuan Jia. 2025. UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation. arXiv preprint arXiv:2508.18652(2025)

  12. [12]

    Gionis, Piotr Indyk, and Rajeev Motwani

    A. Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. InVery Large Data Bases Conference. https://api. semanticscholar.org/CorpusID:1578969

  13. [13]

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reason- ing Steps. InProceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 6609–6625. doi:10.18653/v1/2020.colin...

  14. [14]

    Parameswaran, and Eugene Wu

    Guoyu Hu, Shaofeng Cai, Tien Tuan Anh Dinh, Zhongle Xie, Cong Yue, Gang Chen, and Beng Chin Ooi. 2025.HAKES: Scalable Vector Database for Embedding Search Service.Proceedings of the VLDB Endowment18, 9 (May 2025), 3049–3062. doi:10.14778/3746405.3746427

  15. [15]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised Dense Infor- mation Retrieval with Contrastive Learning.Trans. Mach. Learn. Res.2022 (2021). https://api.semanticscholar.org/CorpusID:249097975

  16. [16]

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. Diskann: Fast accurate billion-point nearest neighbor search on a single node.Advances in neural information pro- cessing Systems32 (2019)

  17. [17]

    Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence33 (2011), 117–128. https://api.semanticscholar.org/CorpusID: 5850884

  18. [18]

    Wenqi Jiang, Hang Hu, Torsten Hoefler, and Gustavo Alonso. 2025. Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal. Proceedings of the VLDB Endowment18, 11 (July 2025), 3797–3811. doi:10.14778/ 3749646.3749655

  19. [19]

    Wenqi Jiang, Marco Zeller, Roger Waleffe, Torsten Hoefler, and Gustavo Alonso

  20. [20]

    2024), 42–52

    Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models.Proceedings of the VLDB Endowment18, 1 (Sept. 2024), 42–52. doi:10.14778/3696435.3696439

  21. [21]

    Yang Jiao, Xiaodong Wang, and Kai Yang. 2025. PR-Attack: Coordinated Prompt- RAG Attacks on Retrieval-Augmented Generation in Large Language Mod- els via Bilevel Optimization.Proceedings of the 48th International ACM SI- GIR Conference on Research and Development in Information Retrieval(2025). https://api.semanticscholar.org/CorpusID:277667367

  22. [22]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. doi:10.18653/v1/...

  23. [23]

    Vladimir Koltchinskii and Karim Lounici. 2014. Asymptotics and Concentration Bounds for Bilinear Forms of Spectral Projectors of Sample Covariance.arXiv: Statistics Theory(2014). https://api.semanticscholar.org/CorpusID:88513200

  24. [24]

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. 2019. Natural Questions: A Benchmark for Question Answering Research. InTACL

  25. [25]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Gen- eration for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M...

  26. [26]

    Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick Von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger,...

  27. [27]

    Guoliang Li, Xuanhe Zhou, and Xinyang Zhao. 2024. LLM for Data Management. Proceedings of the VLDB Endowment17, 12 (Aug. 2024), 4213–4216. doi:10.14778/ 3685800.3685838

  28. [28]

    Dawei Liu, Bolong Zheng, Ziyang Yue, Fuhao Ruan, Xiaofang Zhou, and Chris- tian S. Jensen. 2025. Wolverine: Highly Efficient Monotonic Search Path Repair for Graph-Based ANN Index Updates.Proceedings of the VLDB Endowment18, 7 (March 2025), 2268–2280. doi:10.14778/3734839.3734860

  29. [29]

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)

  30. [30]

    Yashunin

    Yury Malkov and Dmitry A. Yashunin. 2016. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42 (2016), 824–836. https://api.semanticscholar.org/CorpusID:8915893

  31. [31]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–

  32. [32]

    doi:10.1109/TPAMI.2018.2889473

  33. [33]

    Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M

    John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M. Rush

  34. [34]

    InConference on Empirical Methods in Natural Language Processing

    Text Embeddings Reveal (Almost) As Much As Text. InConference on Empirical Methods in Natural Language Processing. https://api.semanticscholar. org/CorpusID:263829206

  35. [35]

    James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.The VLDB Journal33, 5 (2024), 1591–1615

  36. [36]

    James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Vector Database Manage- ment Techniques and Systems. InCompanion of the 2024 International Conference on Management of Data(Santiago AA, Chile)(SIGMOD ’24). Association for Com- puting Machinery, New York, NY, USA, 597–604. doi:10.1145/3626246.3654691

  37. [37]

    Zhencan Peng, Miao Qiao, Wenchao Zhou, Feifei Li, and Dong Deng. 2025. Dynamic Range-Filtering Approximate Nearest Neighbor Search.Proceedings of the VLDB Endowment18, 10 (June 2025), 3256–3268. doi:10.14778/3748191. 3748193

  38. [38]

    Matos-Carvalho, and Nuno Fachada

    Alina Petukhova, João P. Matos-Carvalho, and Nuno Fachada. 2024. Text Clus- tering with Large Language Model Embeddings.CoRRabs/2403.15112 (2024). arXiv:2403.15112 [cs.CL] https://arxiv.org/abs/2403.15112

  39. [39]

    Sara Rajaee and Mohammad Taher Pilehvar. 2022. An Isotropy Analysis in the Multilingual BERT Embedding Space. InFindings of the Association for Com- putational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1309–1316. doi:10.18653/v1/2022.findings-acl.103

  40. [40]

    Stein- metz, and Eric Shea-Brown

    Stefano Recanatesi, Serena Bradde, Vijay Balasubramanian, Nicholas A. Stein- metz, and Eric Shea-Brown. 2020. A scale-dependent measure of system dimen- sionality.Patterns3 (2020). https://api.semanticscholar.org/CorpusID:229549825

  41. [41]

    Lingfeng Shen, Haiyun Jiang, Lemao Liu, and Shuming Shi. 2023. Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language13 Model. InProceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). Association for Computational Linguistics, Toronto, Canada, 315–333. doi:10.18653/v1/2023.repl4nlp-1.26

  42. [42]

    Joobo Shim, Jaewon Oh, Hongchan Roh, Jaeyoung Do, and Sang-Won Lee. 2025. Turbocharging Vector Databases Using Modern SSDs.Proceedings of the VLDB Endowment18, 11 (July 2025), 4710–4722. doi:10.14778/3749646.3749724

  43. [43]

    Ji Sun, Guoliang Li, James Pan, Jiang Wang, Yongqing Xie, Ruicheng Liu, and Wen Nie. 2025. GaussDB-Vector: A Large-Scale Persistent Real-Time Vector Database for LLM Applications.Proceedings of the VLDB Endowment18, 12 (2025), 4951–4963

  44. [44]

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal

  45. [45]

    M u S i Q ue: Multihop questions via single-hop question composition

    MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics10 (2022), 539–554. doi:10.1162/tacl_a_00475

  46. [46]

    Trosten, Rwiddhi Chakraborty, Sigurd Løkse, Kristoffer Wickstrøm, Robert Jenssen, and Michael C

    Daniel J. Trosten, Rwiddhi Chakraborty, Sigurd Løkse, Kristoffer Wickstrøm, Robert Jenssen, and Michael C. Kampffmeyer. 2023. Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning with Hyper- spherical Embeddings.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2023), 7527–7536. https://api.semantic...

  47. [47]

    2018.High-Dimensional Probability: An Introduction with Applications in Data Science

    Roman Vershynin. 2018.High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press

  48. [48]

    Dongsheng Wang, Dandan Guo, He Zhao, Huangjie Zheng, Korawat Tanwisuth, Bo Chen, and Mingyuan Zhou. 2022. Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings. InInternational Conference on Learning Representations (ICLR) 2022. https://openreview.net/forum?id=IYMuTbGzjFU

  49. [49]

    H. Wang, S. Guo, J. He, H. Liu, T. Zhang, and T. Xiang. 2025. Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability. InProceedings of the ACM Web Conference 2025 (WWW ’25)

  50. [50]

    J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang, X. Guo, C. Li, X. Xu, et al. 2021. Milvus: A purpose-built vector data management system. InProceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). 2614–2627

  51. [51]

    Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A Com- prehensive Survey and Experimental Comparison of Graph-Based Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.14 (2021), 1964–1978. https://api.semanticscholar.org/CorpusID:231728434

  52. [52]

    Weinberger, and Laurens van der Maaten

    Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Laurens van der Maaten

  53. [53]

    arXiv preprint arXiv:1911.04623 , year=

    SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learn- ing.ArXivabs/1911.04623 (2019). https://api.semanticscholar.org/CorpusID: 207863469

  54. [54]

    Average” Approximates “First Principal Component

    Zihan Wang, Chengyu Dong, and Jingbo Shang. 2021. “Average” Approximates “First Principal Component”? An Empirical Analysis on Representations from Neural Language Models. InProceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, 5594–5603. doi:10.18653/v1/2021.emnlp-main.453

  55. [55]

    Weaviate B.V. [n. d.]. weaviate/weaviate: Weaviate – Open-source Vector Data- base. https://github.com/weaviate/weaviate. Accessed: 2025-10-12

  56. [56]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian yun Nie. 2023. C-Pack: Packed Resources For General Chinese Embeddings. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(2023). https://api.semanticscholar.org/ CorpusID:271114619

  57. [57]

    Jing Xu, Xu Luo, Xinglin Pan, Wenjie Pei, Yanan Li, and Zenglin Xu. 2022. Alle- viating the Sample Selection Bias in Few-shot Learning by Removing Projection to the Centroid.ArXivabs/2210.16834 (2022). https://api.semanticscholar.org/ CorpusID:253237735

  58. [58]

    Shuo Yang, Jiadong Xie, Yingfan Liu, Jeffrey Xu Yu, Xiyue Gao, Qianru Wang, Yanguo Peng, and Jiangtao Cui. 2024. Revisiting the Index Construction of Prox- imity Graph-Based Approximate Nearest Neighbor Search.Proc. VLDB Endow. 18 (2024), 1825–1838. https://api.semanticscholar.org/CorpusID:273025855

  59. [59]

    Wenkai Yang, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, and Bin He. 2021. Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models.ArXivabs/2103.15543 (2021). https: //api.semanticscholar.org/CorpusID:232404131

  60. [60]

    Zehai Yang and Shimin Chen. 2026. RAIRS: Optimizing Redundant Assignment and List Layout for IVF-Based ANN Search.arXiv preprint arXiv:2601.07183 (2026)

  61. [61]

    Zhilin Yang, Peng Qi, Saizheng Zhang, et al . 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InEMNLP

  62. [62]

    Shohei Yoda, Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda. 2024. Sen- tence Representations via Gaussian Embedding. InProceedings of the 18th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 418–425. https://aclanthology.org/2024.eacl-short.36/

  63. [63]

    arXiv preprint arXiv:2504.00147

    Collin Zhang, John X. Morris, and Vitaly Shmatikov. 2025. Universal Zero-shot Embedding Inversion.ArXivabs/2504.00147 (2025). https://api.semanticscholar. org/CorpusID:277467864

  64. [64]

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. mGTE: Generalized Long-Context Text Representation and Rerank- ing Models for Multilingual Text Retrieval. InConference on Empirical Meth- ods in Natural Language Processing. https://api.se...

  65. [65]

    Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, and Jie Tang. 2024. LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Associa...

  66. [66]

    Xinyang Zhao, Xuanhe Zhou, and Guoliang Li. 2024. Chat2data: An interactive data analysis system with rag, vector databases and llms.Proceedings of the VLDB Endowment17, 12 (2024), 4481–4484

  67. [67]

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. InUSENIX Security Symposium. https://api.semanticscholar. org/CorpusID:271854736 14