"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval
Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3
The pith
Surrogate models let attackers craft query-free injections that shift LLM retrieval rankings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish a theoretical framework for LLM-based retrieval and use it to formulate transferable adversarial injection as a min-max problem. We solve the problem with an adversarial learning procedure that optimizes injection tokens on zero-shot surrogate models while treating queries as learnable variables. The resulting tokens alter document rankings on benchmark datasets across multiple LLM retrievers without any knowledge of the victim query or model.
What carries the argument
Min-max simulation of transferable attack solved by adversarial learning over surrogate models and learnable query samples.
If this is right
- The attack succeeds without any query being supplied to the attacker.
- The same tokens affect retrieval performance across different LLM retrievers.
- Ordinary document edits could produce similar unintended ranking shifts.
- Defenses for retrieval systems must address query-independent threats.
Where Pith is reading between the lines
- Document pipelines may need automated checks for token patterns that mimic these injections.
- Robustness benchmarks for retrievers should include query-agnostic test cases.
- Natural wording variations in real documents could be tested to see if they trigger comparable retrieval biases.
Load-bearing premise
Injection tokens found on surrogate models will transfer to unknown victim retrievers even when the attacker has no query information at all.
What would settle it
Apply the generated tokens to documents and measure whether retrieval rank changes occur on held-out LLM retrievers when the attack procedure is given no query whatsoever.
Figures
read the original abstract
Large language models (LLMs) have been serving as effective backbones for retrieval systems, including Retrieval-Augmentation-Generation (RAG), Dense Information Retriever (IR), and Agent Memory Retrieval. Recent studies have demonstrated that such LLM-based Retrieval (LLMR) is vulnerable to adversarial attacks, which manipulates documents by token-level injections and enables adversaries to either boost or diminish these documents in retrieval tasks. However, existing attack studies mainly (1) presume a known query is given to the attacker, and (2) highly rely on access to the victim model's parameters or interactions, which are hardly accessible in real-world scenarios, leading to limited validity. To further explore the secure risks of LLMR, we propose a practical black-box attack method that generates transferable injection tokens based on zero-shot surrogate LLMs without need of victim queries or victim models knowledge. The effectiveness of our attack raises such a robustness issue that similar effects may arise from benign or unintended document edits in the real world. To achieve our attack, we first establish a theoretical framework of LLMR and empirically verify it. Under the framework, we simulate the transferable attack as a min-max problem, and propose an adversarial learning mechanism that finds optimal adversarial tokens with learnable query samples. Our attack is validated to be effective on benchmark datasets across popular LLM retrievers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a query-agnostic black-box attack on LLM-based retrieval (LLMR) systems for RAG, dense IR, and agent memory. It first establishes and empirically verifies a theoretical framework modeling LLMR via inner-product similarity in embeddings. The attack is then formulated as a min-max optimization over surrogate LLM parameters and learnable query samples to discover transferable token injections, without requiring victim queries or model access. Effectiveness is validated on benchmark datasets across popular LLM retrievers, with the claim that similar effects could arise from benign document edits.
Significance. If the transferability results hold, the work is significant for highlighting practical robustness risks in deployed LLM retrieval pipelines. The theoretical framework plus the surrogate-based min-max simulation provides a principled way to study query-agnostic attacks, and the empirical validation on multiple retrievers strengthens the case that black-box threats are realistic. Credit is due for the zero-shot surrogate approach and the explicit framing of the attack as a simulation tool rather than a fitted model.
major comments (2)
- [§3] §3: The LLMR framework reduces retrieval to inner-product similarity, which is then used to justify the min-max attack formulation. However, the transferability premise—that surrogate-optimized tokens and learned queries will align with an unseen victim retriever’s embedding geometry without any query adaptation—is stated but not bounded or tested for distribution shift; this assumption is load-bearing for the central query-agnostic claim.
- [§5] §5 (empirical validation): The reported effectiveness on benchmarks lacks explicit controls for embedding-space misalignment between surrogate and victim models, effect-size reporting, or ablation studies isolating the contribution of the learnable query samples versus fixed queries. Without these, it is difficult to confirm that the attack generalizes query-agnostically rather than succeeding only under favorable alignment conditions.
minor comments (2)
- [Abstract] Abstract: The sentence 'which manipulates documents by token-level injections and enables adversaries to either boost or diminish these documents' contains a subject-verb agreement issue ('manipulates' should be 'manipulate' or rephrased).
- Notation: The distinction between surrogate parameters and victim parameters in the min-max objective could be clarified with an explicit variable table or consistent subscripting.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our perspective on the current manuscript and indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§3] §3: The LLMR framework reduces retrieval to inner-product similarity, which is then used to justify the min-max attack formulation. However, the transferability premise—that surrogate-optimized tokens and learned queries will align with an unseen victim retriever’s embedding geometry without any query adaptation—is stated but not bounded or tested for distribution shift; this assumption is load-bearing for the central query-agnostic claim.
Authors: Our theoretical framework explicitly models LLMR retrieval as inner-product similarity in embedding space and we empirically verify this modeling choice on the surrogate models used. The min-max formulation with learnable queries is designed to optimize for tokens that remain effective across query variations, which underpins the query-agnostic transfer claim. We demonstrate this empirically by transferring the resulting injection tokens to multiple victim retrievers with distinct embedding models, without any query-specific adaptation. We agree, however, that the manuscript would benefit from a more explicit treatment of distribution shift. In revision we will add a dedicated discussion subsection that derives the expected transfer conditions from the inner-product assumption and include new experiments that quantify embedding misalignment (via average cosine similarity on shared documents) between surrogate and victim models. revision: yes
-
Referee: [§5] §5 (empirical validation): The reported effectiveness on benchmarks lacks explicit controls for embedding-space misalignment between surrogate and victim models, effect-size reporting, or ablation studies isolating the contribution of the learnable query samples versus fixed queries. Without these, it is difficult to confirm that the attack generalizes query-agnostically rather than succeeding only under favorable alignment conditions.
Authors: We acknowledge that the current empirical section would be strengthened by additional controls and ablations. In the revised manuscript we will: (i) report quantitative measures of embedding-space misalignment between each surrogate and victim model on the evaluation datasets, (ii) include effect-size statistics (mean rank change with standard deviation and confidence intervals) alongside the existing success-rate tables, and (iii) add ablation experiments that compare the full adversarial-learning procedure against variants that use fixed or randomly sampled queries. These changes will more clearly isolate the contribution of the learnable-query component and support the query-agnostic generalization argument. revision: yes
Circularity Check
No significant circularity; theoretical framework and min-max simulation are independent of fitted outputs.
full rationale
The paper first establishes a theoretical framework for LLM-based retrieval based on embedding inner-product similarity and empirically verifies it on data. It then formulates the attack as a min-max optimization over surrogate parameters and learnable query samples to generate transferable tokens. This structure does not reduce any claimed prediction or result to its own inputs by construction, nor does it rely on load-bearing self-citations or imported uniqueness theorems. The transferability is presented as an empirical outcome validated on benchmarks rather than a definitional equivalence or fitted renaming. The derivation chain remains self-contained against external benchmarks with no quoted reduction of outputs to inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first establish a theoretical framework of LLMR and empirically verify it. Under the framework, we simulate the transferable attack as a min-max problem...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 3.1 (ε-pϵ-Precise Retriever)... sim(f(Xi), f(Xj)) ≥ sim(f(Xk), f(X′)) + ϵ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chang, C., Shi, Y ., Cao, D., Yang, W., Hwang, J., Wang, H., Pang, J., Wang, W., Liu, Y ., Peng, W.-C., et al. A sur- vey of reasoning and agentic systems in time series with large language models.arXiv preprint arXiv:2509.11575,
-
[2]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B Hashimoto
Dong, S., Xu, S., He, P., Li, Y ., Tang, J., Liu, T., Liu, H., and Xiang, Z. A practical memory injection attack against llm agents.arXiv preprint arXiv:2503.03704,
-
[3]
Attacking large language models with projected gradient descent
Geisler, S., Wollschläger, T., Abdalla, M., Gasteiger, J., and Günnemann, S. Attacking large language models with projected gradient descent. InICML 2024 Next Generation of AI Safety Workshop. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y . Generative adversarial nets.Advances in neural ...
work page 2024
-
[4]
Hong, G., Kim, J., Kang, J., Myaeng, S.-H., and Whang, J. J. Why so gullible? enhancing the robustness of retrieval- augmented models against counterfactual noise. InFind- ings of the Association for Computational Linguistics: NAACL 2024, pp. 2474–2495,
work page 2024
-
[5]
arXiv preprint arXiv:2404.07981 , year=
Hongjin, S., Yen, H., Xia, M., Shi, W., Muennighoff, N., Wang, H.-y., Haisu, L., Shi, Q., Siegel, Z. S., Tang, M., et al. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval. InThe Thirteenth Interna- tional Conference on Learning Representations. Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Imp...
-
[6]
Graphrag under fire.arXiv preprint arXiv:2501.14050,
Liang, J., Wang, Y ., Li, C., Zhu, R., Jiang, T., Gong, N., and Wang, T. Graphrag under fire.arXiv preprint arXiv:2501.14050,
-
[7]
doi: 10.1109/SP54263.2024. 00049. Liu, Y .-A., Zhang, R., Guo, J., de Rijke, M., Chen, W., Fan, Y ., and Cheng, X. Black-box adversarial attacks against dense retrieval models: A multi-view contrastive learning method. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 1647–1656, 2023a. Liu, Y .-A., Zhang, ...
-
[8]
Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/3626772.3657704. Long, Q., Deng, Y ., Gan, L., Wang, W., and Pan, S. J. Whispers in grammars: Injecting covert backdoors to compromise dense retrieval systems.arXiv preprint arXiv:2402.13532,
-
[9]
Pathmanathan, P., Panaitescu-Liess, M.-A., Chiang, C.-Y . J., and Huang, F. Ragpart & ragmask: Retrieval-stage de- fenses against corpus poisoning in retrieval-augmented generation.arXiv preprint arXiv:2512.24268,
-
[10]
Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024
Pfrommer, S., Bai, Y ., Gautam, T., and Sojoudi, S. Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589,
-
[11]
Schulhoff, S., Pinto, J., Khan, A., Bouchard, L.-F., Si, C., Anati, S., Tagliabue, V ., Kost, A., Carnahan, C., and Boyd-Graber, J. Ignore this title and hackaprompt: Ex- posing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4945–4977,
work page 2023
-
[12]
Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . " do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,
work page 2024
-
[13]
Tang, Y ., Fan, Y ., Yu, C., Yang, T., Zhao, Y ., and Hu, X. Stealthrank: Llm ranking manipulation via stealthy prompt optimization.arXiv preprint arXiv:2504.05804,
-
[14]
Bert rankers are brit- tle: a study using adversarial document perturbations
Wang, Y ., Lyu, L., and Anand, A. Bert rankers are brit- tle: a study using adversarial document perturbations. In Proceedings of the 2022 ACM SIGIR International Con- ference on Theory of Information Retrieval, pp. 115–120,
work page 2022
-
[15]
Certifiably robust rag against retrieval corruption
Xiang, C., Wu, T., Zhong, Z., Wagner, D., Chen, D., and Mittal, P. Certifiably robust rag against retrieval corrup- tion.arXiv preprint arXiv:2405.15556,
-
[16]
Yang, W., Pang, J., Li, S., Bogdan, P., Tu, S., and Thoma- son, J. Maestro: Learning to collaborate via conditional listwise policy optimization for multi-agent llms.arXiv preprint arXiv:2511.06134, 2025a. Yang, W., Weng, M., Pang, J., Cao, D., Ping, H., Zhang, P., Li, S., Zhao, Y ., Yang, Q., Wang, M., et al. Toward evolutionary intelligence: Llm-based a...
-
[17]
doi: 10.1145/3637870. Zhong, Z., Huang, Z., Wettig, A., and Chen, D. Poisoning retrieval corpora by injecting adversarial passages. In 10 “Someone Hid It!”: Query-Agnostic Black-Box Attacks on LLM-Based Retrieval The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a. Zhong, Z., Huang, Z., Wettig, A., and Chen, D. Poisoning retriev...
-
[18]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.