pith. sign in

arxiv: 2412.14113 · v4 · pith:HJOJJ35Knew · submitted 2024-12-18 · 💻 cs.CR · cs.IR

Adversarial Hubness in Multi-Modal Retrieval

Pith reviewed 2026-05-23 06:36 UTC · model grok-4.3

classification 💻 cs.CR cs.IR
keywords adversarial hubnessmulti-modal retrievalhubness phenomenonadversarial attacksvector databasestext-to-image retrievalimage retrieval
0
0 comments X

The pith

Attackers can turn any image into an adversarial hub retrieved as top-1 for over 21,000 out of 25,000 queries in multi-modal systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the known hubness problem in high-dimensional spaces, where certain points end up close to many others, can be deliberately amplified by an attacker. An optimization process creates a single input that becomes unusually close to a wide range of query embeddings, causing it to be returned as the most relevant result for thousands of unrelated queries. This enables both broad injection of unwanted content and targeted attacks on chosen concepts. Experiments on standard caption-to-image benchmarks and a commercial vector database confirm the effect, with one crafted hub dominating far more queries than any natural hub. Standard techniques that reduce natural hubness provide no protection against these crafted versions.

Core claim

A method exists for generating adversarial hubs from arbitrary images or audio that are retrieved as the top match for a large fraction of queries in multi-modal retrieval, with one such hub serving as top-1 for more than 21,000 of 25,000 test queries compared to only 102 for the strongest natural hub.

What carries the argument

An optimization procedure that crafts an input to increase its proximity to many random or targeted query embeddings in the shared multi-modal space.

If this is right

  • Universal spam or malicious content can be injected so that it appears in response to thousands of unrelated user queries.
  • Targeted attacks become feasible by steering hubs toward queries related to attacker-chosen concepts.
  • Existing methods for reducing natural hubness do not prevent adversarial hubs from dominating retrieval results.
  • The attack applies to both benchmark datasets and production vector databases such as Pinecone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval systems may need new defenses that detect or penalize artificially created hubs rather than relying on natural-distribution assumptions.
  • Similar hub-creation attacks could extend to other high-dimensional retrieval tasks beyond multi-modal embeddings.
  • Service providers using vector search should test whether their current models already contain or can be forced to contain such hubs.

Load-bearing premise

The optimization used to create adversarial hubs works across different embedding models and retrieval systems without needing white-box access to the target.

What would settle it

Running the same hub-generation procedure on a multi-modal embedding model different from those tested and measuring whether the resulting hub still ranks in the top position for thousands of queries.

Figures

Figures reproduced from arXiv: 2412.14113 by Collin Zhang, Fnu Suya, Rishi Jha, Tingwei Zhang, Vitaly Shmatikov.

Figure 1
Figure 1. Figure 1: A cross-modal adversarial hub. allowing queries to be effectively matched to items regard￾less of modality, based only on embedding similarity [59]. Embedding-based retrieval is more scalable and accurate than traditional techniques based on metadata or keywords [4, 35]. High-dimensional embedding spaces are prone to hubness, a well-known manifestation of the curse of dimensionality in information retrieva… view at source ↗
Figure 2
Figure 2. Figure 2: An adversarial hub in text-to-image retrieval. Query Result [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An adversarial hub in image-to-image retrieval. Our Contributions. We show that hubness, a natural phe￾nomenon in high-dimensional spaces, can be adversarially exploited. By applying a small perturbation, an attacker can transform any image or audio input with adversary-chosen semantics (e.g., an advertisement, product promotion, song, etc.) into a hub. We demonstrate that a single adversarial hub affects … view at source ↗
Figure 5
Figure 5. Figure 5: shows our threat model. The attacker aims to inject a malicious input ga into the gallery that functions as an adver￾Add perturbation Upload Vector DB Query 1 Encode Search Embedding Adv hub Encode Retrieve Top k documents Query 2 … Output 2 Output 1 … Gallery [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Natural hubs and adversarial hubs. battling malicious actors who attempt to promote their content and manipulate platforms into streaming it for many users. In online retail, hubness attacks could manipulate product search. In social media, they could propagate misinformation and amplify misleading content. These examples show that adversaries have an economic incentive to exploit hubness. 3.2 Attacker Kno… view at source ↗
Figure 7
Figure 7. Figure 7: Retrieval frequency of adversarial and natural hubs. Adversarial hubs (red dots) are retrieved significantly more frequently than natural hubs. Plot width indicates retrieval frequency (log scale), with results averaged over 100 trials. query set Qt . We evaluate attack effectiveness using (1) re￾trieval performance on the intended concept-specific queries (higher is better), and (2) collateral damage, def… view at source ↗
Figure 8
Figure 8. Figure 8: Performance vs. sample size. Attack success rate (ASR) and cosine similarity between adversarial hub and query embeddings improve with larger target sample sizes, converging at around 100 samples. embedding distribution of a much larger set. The diminishing returns beyond this saturation point are likely due to the com￾pactness of the query embedding space, allowing 100 samples to provide a strong approxim… view at source ↗
read the original abstract

Hubness is a phenomenon in high-dimensional vector spaces where a point from the natural distribution is unusually close to many other points. This is a well-known problem in information retrieval that causes some items to accidentally (and incorrectly) appear relevant to many queries. In this paper, we investigate how attackers can exploit hubness to turn any image or audio input in a multi-modal retrieval system into an adversarial hub. Adversarial hubs can be used to inject universal adversarial content (e.g., spam) that will be retrieved in response to thousands of different queries, and also for targeted attacks on queries related to specific, attacker-chosen concepts. We present a method for creating adversarial hubs and evaluate the resulting hubs on benchmark multi-modal retrieval datasets and an image-to-image retrieval system implemented by Pinecone, a popular vector database. For example, in text-caption-to-image retrieval, a single adversarial hub, generated using 100 random queries, is retrieved as the top-1 most relevant image for more than 21,000 out of 25,000 test queries (by contrast, the most common natural hub is the top-1 response to only 102 queries), demonstrating the strong generalization capabilities of adversarial hubs. We also investigate whether techniques for mitigating natural hubness can also mitigate adversarial hubs, and show that they are not effective against hubs that target queries related to specific concepts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that attackers can exploit hubness in multi-modal embedding spaces to create 'adversarial hubs'—optimized inputs that become universal top-1 retrieval results for thousands of queries. Using an optimization procedure on 100 random queries, a single adversarial hub is reported as top-1 for >21,000 of 25,000 held-out test queries in text-to-image retrieval (vs. 102 for the strongest natural hub). The work evaluates this on benchmark datasets and a Pinecone vector database, shows targeted attacks on concept-related queries, and finds that standard natural-hubness mitigations fail against adversarial hubs.

Significance. If the central empirical result holds under the stated threat model, the work identifies a new, high-impact attack surface in multi-modal retrieval: a single crafted item can dominate retrieval for the vast majority of queries, enabling scalable spam injection or concept-targeted poisoning. The reported generalization (100 optimization queries → 21k+ test queries) is quantitatively striking and would constitute a qualitatively stronger universal attack than typical per-query adversarial examples. Credit is due for the concrete Pinecone evaluation and the comparison against natural hub baselines.

major comments (3)
  1. [Method] Method (optimization procedure): the described procedure optimizes directly against the multi-modal embedding function, implying white-box or gradient access. This directly contradicts the claim of applicability to black-box deployed systems such as Pinecone; no surrogate-model transfer, query-only optimization, or API-only attack is described or evaluated.
  2. [Experiments] Experiments (Pinecone evaluation): the reported success on Pinecone is presented as evidence of real-world applicability, yet the manuscript provides no details on how the adversarial hub was generated or inserted without direct embedding access. If the numbers were obtained with white-box access to the underlying model, they do not support the black-box attack claim.
  3. [Experiments] Evaluation (held-out test queries): the 21,000/25,000 top-1 figure is load-bearing for the generalization claim, yet it is unclear whether the 100 optimization queries were drawn from the same distribution as the test set or whether any data leakage occurred; this must be clarified with explicit train/test splits and query provenance.
minor comments (2)
  1. [Abstract] Abstract: the claim that adversarial hubs work for both image and audio inputs is stated but the quantitative results focus exclusively on text-to-image; a brief statement of audio results (or their absence) would improve clarity.
  2. [Introduction] Notation: the term 'adversarial hub' is introduced without a formal definition distinguishing it from a standard adversarial example; a short definitional paragraph early in the paper would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report, and for recognizing the quantitative strength of the generalization result and the Pinecone evaluation. We address each major comment below and will revise the manuscript accordingly to clarify the threat model and experimental details.

read point-by-point responses
  1. Referee: [Method] Method (optimization procedure): the described procedure optimizes directly against the multi-modal embedding function, implying white-box or gradient access. This directly contradicts the claim of applicability to black-box deployed systems such as Pinecone; no surrogate-model transfer, query-only optimization, or API-only attack is described or evaluated.

    Authors: We agree that the optimization procedure requires direct (white-box) access to the embedding function and its gradients. The manuscript does not describe or evaluate any black-box, query-only, or surrogate-based method. We will revise the text to explicitly state the white-box threat model for hub generation and to remove any implication of black-box applicability to systems such as Pinecone. revision: yes

  2. Referee: [Experiments] Experiments (Pinecone evaluation): the reported success on Pinecone is presented as evidence of real-world applicability, yet the manuscript provides no details on how the adversarial hub was generated or inserted without direct embedding access. If the numbers were obtained with white-box access to the underlying model, they do not support the black-box attack claim.

    Authors: The Pinecone results were obtained by first generating the adversarial hub with white-box access to the underlying embedding model and then inserting the resulting vector into the Pinecone index. We will add a dedicated paragraph in the revised manuscript that (i) states the white-box assumption for generation and (ii) clarifies that the Pinecone experiment demonstrates the retrieval impact once an adversarial item has been inserted into a production vector database, rather than claiming a black-box generation procedure. revision: yes

  3. Referee: [Experiments] Evaluation (held-out test queries): the 21,000/25,000 top-1 figure is load-bearing for the generalization claim, yet it is unclear whether the 100 optimization queries were drawn from the same distribution as the test set or whether any data leakage occurred; this must be clarified with explicit train/test splits and query provenance.

    Authors: The 100 optimization queries were sampled from the training split of the underlying dataset and are disjoint from the 25,000 queries in the standard held-out test split. We will add an explicit statement of the train/test split provenance and confirm the absence of overlap in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on held-out queries

full rationale

The paper's central results consist of direct empirical measurements: an optimization procedure produces a hub that is then counted as top-1 for >21k/25k held-out test queries, contrasted with natural-hub baselines. No equations or claims reduce a derived quantity to its own fitted inputs by construction, no load-bearing self-citations are invoked to establish uniqueness or correctness, and the method is presented as a standard optimization evaluated on external benchmarks and Pinecone. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; central claim rests on existence of an unspecified optimization procedure that produces hubs with the stated generalization property and on the assumption that benchmark datasets and Pinecone are representative of real systems.

axioms (1)
  • domain assumption Hubness is a phenomenon in high-dimensional vector spaces where a point from the natural distribution is unusually close to many other points.
    Explicitly stated as a well-known problem in the abstract.
invented entities (1)
  • adversarial hub no independent evidence
    purpose: Crafted input that is retrieved for many queries
    Core new concept introduced to describe the attack output; no independent evidence outside the paper is mentioned.

pith-pipeline@v0.9.0 · 5783 in / 1151 out tokens · 39526 ms · 2026-05-23T06:36:50.654232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

    cs.CL 2026-04 unverdicted novelty 5.0

    A single hub text can unreasonably match many images in CLIP-based similarity, exposing vulnerabilities in cross-modal encoders for caption evaluation and retrieval.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Square Attack: a query- efficient black-box adversarial attack via random search

    Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square Attack: a query- efficient black-box adversarial attack via random search. In European Conference on Computer Vision (ECCV), 2020

  2. [2]

    Introducing the next generation of Claude

    Anthropic. Introducing the next generation of Claude. https://www.anthropic.com/news/ claude-3-family, March 2024. Accessed: August 24, 2025

  3. [3]

    Ob- fuscated gradients give a false sense of security: Circum- venting defenses to adversarial examples

    Anish Athalye, Nicholas Carlini, and David Wagner. Ob- fuscated gradients give a false sense of security: Circum- venting defenses to adversarial examples. In Interna- tional Conference on Machine Learning (ICML), pages 274–283. PMLR, 2018

  4. [4]

    Match- ing words and pictures

    Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M Blei, and Michael I Jordan. Match- ing words and pictures. The Journal of Machine Learn- ing Research, 3:1107–1135, 2003. 2https://github.com/Tingwei-Zhang/adv_hub 13

  5. [5]

    Cross modal retrieval with Querybank normalisation

    Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. Cross modal retrieval with Querybank normalisation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  6. [6]

    Adversarial Patch

    Tom B Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. Adversarial patch. arXiv:1712.09665, 2017

  7. [7]

    Poisoning web-scale training datasets is practi- cal

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practi- cal. In IEEE Symposium on Security and Privacy (S&P), pages 407–425, 2024

  8. [8]

    Adversarial examples are not easily detected: Bypassing ten detection methods

    Nicholas Carlini, Milad Nasr, Christopher A. Choquette- Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, and Ludwig Schmidt. Are aligned neural net- works adversarially aligned? arXiv:2306.15447, 2023

  9. [9]

    Towards evaluating the robustness of neural networks

    Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S&P), pages 39–57, 2017

  10. [10]

    Audio adversarial examples: Targeted attacks on speech-to-text

    Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In IEEE Symposium on Security and Privacy Workshops, 2018

  11. [11]

    Phantom: General trigger attacks on retrieval augmented language generation,

    Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Mi- lad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phan- tom: General trigger attacks on retrieval augmented lan- guage generation. arXiv:2405.20485, 2024

  12. [12]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325, 2015

  13. [13]

    Reproducible scaling laws for contrastive language- image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  14. [14]

    Nearest neighbor normalization improves multimodal retrieval

    Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, and Tristan Thrush. Nearest neighbor normalization improves multimodal retrieval. In Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), 2024

  15. [15]

    Embeddings apis overview

    Google Cloud. Embeddings apis overview. https://cloud.google.com/vertex-ai/ generative-ai/docs/embeddings, 2025. Ac- cessed: 2025-04-09

  16. [16]

    Cer- tified adversarial robustness via randomized smooth- ing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Cer- tified adversarial robustness via randomized smooth- ing. In International Conference on Machine Learning (ICML), pages 1310–1320. PMLR, 2019

  17. [17]

    Word transla- tion without parallel data

    Alexis Conneau, Guillaume Lample, Marc’Aurelio Ran- zato, Ludovic Denoyer, and Hervé Jégou. Word transla- tion without parallel data. In International Conference on Learning Representations (ICLR), 2018

  18. [18]

    Improving zero-shot learning by mitigating the hubness problem

    Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. Improving zero-shot learning by mitigating the hubness problem. arXiv:1412.6568, 2014

  19. [19]

    How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

    Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, and Jun Zhu. How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

  20. [20]

    Adversarial attacks to multi-modal models

    Zhihao Dou, Xin Hu, Haibo Yang, Zhuqing Liu, and Minghong Fang. Adversarial attacks to multi-modal models. arXiv:2409.06793, 2024

  21. [21]

    Hubness as a case of technical algo- rithmic bias in music recommendation

    Arthur Flexer, Monika Dörfler, Jan Schlüter, and Thomas Grill. Hubness as a case of technical algo- rithmic bias in music recommendation. In IEEE In- ternational Conference on Data Mining Workshops (ICDMW), 2018

  22. [22]

    Adversarial robustness for visual ground- ing of multimodal large language models

    Kuofeng Gao, Yang Bai, Jiawang Bai, Yong Yang, and Shu-Tao Xia. Adversarial robustness for visual ground- ing of multimodal large language models. InICLR Work- shop on Reliable and Responsible Foundation Models, 2024

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  24. [24]

    ImageBind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man- nat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. ImageBind: One embedding space to bind them all. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  25. [25]

    Explaining and harnessing adversarial exam- ples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial exam- ples. In International Conference on Learning Repre- sentations (ICLR), 2015

  26. [26]

    On the effectiveness of interval bound propagation for training verifiably robust models

    Sven Gowal, Krishnamurthy Dvijotham, Robert Stan- forth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy A Mann, and Pushmeet Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv:1810.12715, 2018. 14

  27. [27]

    Countering Adversarial Images using Input Transformations

    C Guo, M Rana, M Cisse, and L Van Der Maaten. Coun- tering adversarial images using input transformations. arXiv:1711.00117, 2021

  28. [28]

    AudioCLIP: Extending CLIP to image, text and audio

    Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. AudioCLIP: Extending CLIP to image, text and audio. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  29. [29]

    Localized centering: Reducing hubness in large-sample data

    Kazuo Hara, Ikumi Suzuki, Masashi Shimbo, Kei Kobayashi, Kenji Fukumizu, and Miloš Radovanovi´c. Localized centering: Reducing hubness in large-sample data. In Proceedings of the AAAI Conference on Artifi- cial Intelligence (AAAI), 2015

  30. [30]

    Adversarial example defense: Ensem- bles of weak defenses are not strong

    Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defense: Ensem- bles of weak defenses are not strong. In 11th USENIX workshop on offensive technologies (WOOT), 2017

  31. [31]

    Defending a music recommender against hubness-based adversarial attacks

    Katharina Hoedt, Arthur Flexer, and Gerhard Widmer. Defending a music recommender against hubness-based adversarial attacks. In Proceedings of the 19th Sound and Music Computing Conference (SMC), 2022

  32. [32]

    Deceiving Google's Perspective API Built for Detecting Toxic Comments

    Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. Deceiving Google’s per- spective API built for detecting toxic comments. arXiv:1702.08138, 2017

  33. [33]

    Subpopula- tion data poisoning attacks

    Matthew Jagielski, Giorgio Severi, Niklas Pousette Harger, and Alina Oprea. Subpopula- tion data poisoning attacks. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 3104–3122, 2021

  34. [34]

    A contextual dissimilarity measure for accurate and effi- cient image search

    Herve Jegou, Hedi Harzallah, and Cordelia Schmid. A contextual dissimilarity measure for accurate and effi- cient image search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007

  35. [35]

    Automatic image annotation and retrieval using cross- media relevance models

    Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. Automatic image annotation and retrieval using cross- media relevance models. In 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2003

  36. [36]

    Adversarial examples for evaluating reading comprehension systems

    Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2017

  37. [37]

    La cryptographie militaire

    Auguste Kerckhoffs. La cryptographie militaire. BoD– Books on Demand, 2023

  38. [38]

    AudioCaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. AudioCaps: Generating captions for audios in the wild. In Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (NAACL), 2019

  39. [39]

    Ad- versarial self-supervised contrastive learning

    Minseon Kim, Jihoon Tack, and Sung Ju Hwang. Ad- versarial self-supervised contrastive learning. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2020

  40. [40]

    Learning multiple layers of fea- tures from tiny images

    Alex Krizhevsky. Learning multiple layers of fea- tures from tiny images. Technical report, University of Toronto, 2009

  41. [41]

    Multi-targeted adversarial example in evasion attack on deep neural network.IEEE Access, 6:46084–46096, 2018

    Hyun Kwon, Yongchul Kim, Ki-Woong Park, Hyunsoo Yoon, and Daeseon Choi. Multi-targeted adversarial example in evasion attack on deep neural network.IEEE Access, 6:46084–46096, 2018

  42. [42]

    Hubness and pollution: Delving into cross-space map- ping for zero-shot learning

    Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. Hubness and pollution: Delving into cross-space map- ping for zero-shot learning. In 53rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2015

  43. [43]

    HAL: Improved text-image matching by mitigating visual semantic hubs

    Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. HAL: Improved text-image matching by mitigating visual semantic hubs. In AAAI Conference on Artificial Intelligence (AAAI), 2020

  44. [44]

    Delving into transferable adversarial examples and black-box attacks

    Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations (ICLR), 2017

  45. [45]

    Feature distillation: Dnn- oriented jpeg compression against adversarial examples

    Zihao Liu, Qi Liu, Tao Liu, Nuo Xu, Xue Lin, Yanzhi Wang, and Wujie Wen. Feature distillation: Dnn- oriented jpeg compression against adversarial examples. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2019

  46. [46]

    The hubness phenomenon: Fact or artifact? Towards Advanced Data Analysis by Com- bining Soft Computing and Statistics , pages 267–278, 2013

    Thomas Low, Christian Borgelt, Sebastian Stober, and Andreas Nürnberger. The hubness phenomenon: Fact or artifact? Towards Advanced Data Analysis by Com- bining Soft Computing and Statistics , pages 267–278, 2013

  47. [47]

    Distinctive image features from scale- invariant keypoints

    David G Lowe. Distinctive image features from scale- invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004

  48. [48]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018

  49. [49]

    Universal adversarial perturbations

    Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 15

  50. [50]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep- resentation learning with contrastive predictive coding. arXiv:1807.03748, 2018

  51. [51]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. https://openai. com/research/gpt-4, 2023. Accessed: 2025-04-09

  52. [52]

    Hello GPT-4o

    OpenAI. Hello GPT-4o. https://openai.com/ index/hello-gpt-4o/, May 2024. Accessed: Au- gust 24, 2025

  53. [53]

    Speechguard: Exploring the adversarial robustness of multimodal large language models, 2024

    Raghuveer Peri, Sai Muralidhar Jayanthi, Srikanth Ro- nanki, Anshu Bhatia, Karel Mundnich, Saket Dingliwal, Nilaksh Das, Zejiang Hou, Goeric Huybrechts, Srikanth Vishnubhotla, et al. SpeechGuard: Exploring the adver- sarial robustness of multimodal large language models. arXiv:2405.08317, 2024

  54. [54]

    Pinecone - vector database for machine learn- ing

    Pinecone. Pinecone - vector database for machine learn- ing. https://www.pinecone.io, 2024. Accessed: 2024-11-12

  55. [55]

    Visual adversarial examples jailbreak aligned large language models

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models. In Interna- tional Conference on Machine Learning (ICML) Work- shop on New Frontiers in Adversarial Machine Learn- ing, 2023

  56. [56]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International Conference on Ma- chine Learning (ICML), 2021

  57. [57]

    Hubs in space: Popular nearest neighbors in high-dimensional data

    Milos Radovanovic, Alexandros Nanopoulos, and Mir- jana Ivanovic. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research (JMLR), 11(9):2487–2531, 2010

  58. [58]

    Term- weighting approaches in automatic text retrieval

    Gerard Salton and Christopher Buckley. Term- weighting approaches in automatic text retrieval. In- formation Processing & Management, 24(5):513–523, 1988

  59. [59]

    Learning cross-modal embeddings for cooking recipes and food images

    Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. Learning cross-modal embeddings for cooking recipes and food images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  60. [60]

    Local and global scaling reduce hubs in space

    Dominik Schnitzer, Arthur Flexer, Markus Schedl, and Gerhard Widmer. Local and global scaling reduce hubs in space. Journal of Machine Learning Research (JMLR), 13:2871–2902, 2012

  61. [61]

    FaceNet: A unified embedding for face recog- nition and clustering

    Florian Schro ff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recog- nition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  62. [62]

    Poison frogs! Targeted clean-label poisoning at- tacks on neural networks

    Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Gold- stein. Poison frogs! Targeted clean-label poisoning at- tacks on neural networks. In Advances in Neural Infor- mation Processing Systems (NIPS), 2018

  63. [63]

    Machine against the RAG: Jamming retrieval- augmented generation with blocker documents

    Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the RAG: Jamming retrieval- augmented generation with blocker documents. In USENIX Security Symposium, 2024

  64. [64]

    Plug and Pray: Exploiting off-the-shelf components of multi-modal models

    Erfan Shayegani, Yue Dong, and Nael Abu-Ghazaleh. Plug and Pray: Exploiting off-the-shelf components of multi-modal models. arXiv:2307.14539, 2023

  65. [65]

    JPEG-resistant adver- sarial images

    Richard Shin and Dawn Song. JPEG-resistant adver- sarial images. In NIPS 2017 Workshop on Machine Learning and Computer Security, 2017

  66. [66]

    Offline bilingual word vectors, orthogonal transformations and the inverted softmax

    Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In International Conference on Learning Representations (ICLR), 2017

  67. [67]

    Artificial streaming, 2023

    Spotify for Artists. Artificial streaming, 2023. Accessed: 2024-11-13

  68. [68]

    Hybrid batch attacks: Finding black-box adversarial ex- amples with limited queries

    Fnu Suya, Jianfeng Chi, David Evans, and Yuan Tian. Hybrid batch attacks: Finding black-box adversarial ex- amples with limited queries. In USENIX Security Sym- posium, pages 1327–1344, 2020

  69. [69]

    Model-targeted poisoning attacks with provable convergence

    Fnu Suya, Saeed Mahloujifar, Anshuman Suri, David Evans, and Yuan Tian. Model-targeted poisoning attacks with provable convergence. In International Confer- ence on Machine Learning (ICML), pages 10000–10010. PMLR, 2021

  70. [70]

    Investigating the e ffec- tiveness of laplacian-based kernels in hub reduction

    Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Yuji Mat- sumoto, and Marco Saerens. Investigating the e ffec- tiveness of laplacian-based kernels in hub reduction. In AAAI Conference on Artificial Intelligence (AAAI), 2012

  71. [71]

    Centering similarity mea- sures to reduce hubs

    Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, Marco Saerens, and Kenji Fukumizu. Centering similarity mea- sures to reduce hubs. In Conference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2013

  72. [72]

    Intriguing properties of neural networks

    C Szegedy. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014. 16

  73. [73]

    Adversarial training and robustness for multiple perturbations

    Florian Tramer and Dan Boneh. Adversarial training and robustness for multiple perturbations. Advances in Neural Information Processing Systems (NIPS), 2019

  74. [74]

    Caltech-UCSD Birds-200- 2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. Caltech-UCSD Birds-200- 2011 dataset. 2011

  75. [75]

    Adversarial cross-modal retrieval

    Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. Adversarial cross-modal retrieval. In 25th ACM International Conference on Multimedia (ACM MM), 2017

  76. [76]

    A Comprehensive Survey on Cross-modal Retrieval

    Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215, 2016

  77. [77]

    Cross-modal retrieval: a systematic review of methods and future directions.Pro- ceedings of the IEEE, 112(11):1716–1754, 2024

    Tianshi Wang, Fengling Li, Lei Zhu, Jingjing Li, Zheng Zhang, and Heng Tao Shen. Cross-modal retrieval: a systematic review of methods and future directions.Pro- ceedings of the IEEE, 112(11):1716–1754, 2024

  78. [78]

    Balance Act: Mitigating hubness in cross-modal retrieval with query and gallery banks

    Yimu Wang, Xiangru Jian, and Bo Xue. Balance Act: Mitigating hubness in cross-modal retrieval with query and gallery banks. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  79. [79]

    Provable defenses against adversarial examples via the convex outer adversarial polytope

    Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning (ICML), pages 5286–5295. PMLR, 2018

  80. [80]

    Adversarial at- tacks on multimodal agents

    Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. Adversarial at- tacks on multimodal agents. arXiv:2406.12814, 2024

Showing first 80 references.