pith. sign in

arxiv: 2602.00364 · v4 · pith:SGUMAJX2new · submitted 2026-01-30 · 💻 cs.CR

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

Pith reviewed 2026-05-21 14:13 UTC · model grok-4.3

classification 💻 cs.CR
keywords black-box attacksLLM retrievaladversarial injectionsquery-agnostictransferable attacksRAG securityinformation retrieval
0
0 comments X

The pith

Surrogate models let attackers craft query-free injections that shift LLM retrieval rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLM-based retrieval can be manipulated by injecting special tokens into documents even when the attacker knows nothing about the user's query and has no access to the target model's parameters or outputs. It does this by first laying out a theoretical account of how these retrieval systems rank documents, then turning the problem of finding effective injections into a min-max optimization that is solved on surrogate models using simulated queries. A reader should care because real-world systems such as RAG pipelines are exposed if the same tokens transfer across models and queries; the work also notes that ordinary document changes might produce comparable effects. The method is tested on standard retrieval benchmarks and several popular LLM retrievers.

Core claim

We establish a theoretical framework for LLM-based retrieval and use it to formulate transferable adversarial injection as a min-max problem. We solve the problem with an adversarial learning procedure that optimizes injection tokens on zero-shot surrogate models while treating queries as learnable variables. The resulting tokens alter document rankings on benchmark datasets across multiple LLM retrievers without any knowledge of the victim query or model.

What carries the argument

Min-max simulation of transferable attack solved by adversarial learning over surrogate models and learnable query samples.

If this is right

  • The attack succeeds without any query being supplied to the attacker.
  • The same tokens affect retrieval performance across different LLM retrievers.
  • Ordinary document edits could produce similar unintended ranking shifts.
  • Defenses for retrieval systems must address query-independent threats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Document pipelines may need automated checks for token patterns that mimic these injections.
  • Robustness benchmarks for retrievers should include query-agnostic test cases.
  • Natural wording variations in real documents could be tested to see if they trigger comparable retrieval biases.

Load-bearing premise

Injection tokens found on surrogate models will transfer to unknown victim retrievers even when the attacker has no query information at all.

What would settle it

Apply the generated tokens to documents and measure whether retrieval rank changes occur on held-out LLM retrievers when the attack procedure is given no query whatsoever.

Figures

Figures reproduced from arXiv: 2602.00364 by Chenxiao Yu, Defu Cao, Jiate Li, Li Li, Ryan A. Rossi, Tiannuo Yang, Wei Yang, Xiyang Hu, Yan Liu, Yuehan Qin, Yue Zhao.

Figure 1
Figure 1. Figure 1: In many practical scenarios, attackers may hope to hide web documents from retrieval systems. These websites usually allow normal public users to edit in format of content contribution or discussion replies. 1. Introduction Retrieval system, which aims to efficiently seek most rele￾vant documents for given user queries, not only occupies great importance in applications like search engines and recommendati… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of LLM-based Retrieval. Documents in the corpus are firstly embedded in the last-hidden embeddings and stored. When a user query comes and get embedded, it matches relevant documents in embedding similarity in high efficiency. 1. We first investigate the vulnerability of LLMR in the face of query-agnostic black-box settings, where the attacker has no access to the victim model, the docu￾ment c… view at source ↗
Figure 3
Figure 3. Figure 3: We sample 40 knowledge contexts on each of four roughly-defined topics and visualize their LLMR embeddings by Principal Component Analysis (PCA) reduction. Embeddings of contexts within the same topic (R:science, G:politic, Y:movie, B:architecture) tend to cluster together in the embedding space. 2. The LLM retriever model f is a complete black-box, which means the attacker has no knowledge of f includ￾ing… view at source ↗
Figure 4
Figure 4. Figure 4: The DQ-A learning pipeline of our attack method. Query samples are first generated by a third party casual LLM. Then in every learning step, injected document tokens are first optimized away from queries, and all queries tokens are optimized towards the document. Both surrogate and Casual LLMs require no learning. pϵ. To verify the rationality of this statement, we conduct an experiment: we utilize a Casua… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Different |S| metrics, especially around a 2%-10% drop in Recall@25 and a 1%-7% drop in Recall@50. In dataset Robotics, our at￾tack can achieve even an 8% performance drop for Qwen1.5 and 6% for Jinaai in Recall@50, which reduces these best performing retrievers’ around 30% and 20% drop in fraction of their original performance. Only on Qwen3-Emb-0.6B, both our attack and other baselines find it … view at source ↗
Figure 6
Figure 6. Figure 6: Impact of Injected Token Length increases from 10 to 15, some of attacks become less effec￾tive. From this phenomenon we infer that there is also a limit in |S|’s population effect. When |S| is larger than the inherent optimal ηX ηd ’s require, attacks become less optimal. Impact of Injected Token Amount We also study the impact of the injected token amount (constrained by the δ in Eq.1) on our attack and … view at source ↗
read the original abstract

Large language models (LLMs) have been serving as effective backbones for retrieval systems, including Retrieval-Augmentation-Generation (RAG), Dense Information Retriever (IR), and Agent Memory Retrieval. Recent studies have demonstrated that such LLM-based Retrieval (LLMR) is vulnerable to adversarial attacks, which manipulates documents by token-level injections and enables adversaries to either boost or diminish these documents in retrieval tasks. However, existing attack studies mainly (1) presume a known query is given to the attacker, and (2) highly rely on access to the victim model's parameters or interactions, which are hardly accessible in real-world scenarios, leading to limited validity. To further explore the secure risks of LLMR, we propose a practical black-box attack method that generates transferable injection tokens based on zero-shot surrogate LLMs without need of victim queries or victim models knowledge. The effectiveness of our attack raises such a robustness issue that similar effects may arise from benign or unintended document edits in the real world. To achieve our attack, we first establish a theoretical framework of LLMR and empirically verify it. Under the framework, we simulate the transferable attack as a min-max problem, and propose an adversarial learning mechanism that finds optimal adversarial tokens with learnable query samples. Our attack is validated to be effective on benchmark datasets across popular LLM retrievers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a query-agnostic black-box attack on LLM-based retrieval (LLMR) systems for RAG, dense IR, and agent memory. It first establishes and empirically verifies a theoretical framework modeling LLMR via inner-product similarity in embeddings. The attack is then formulated as a min-max optimization over surrogate LLM parameters and learnable query samples to discover transferable token injections, without requiring victim queries or model access. Effectiveness is validated on benchmark datasets across popular LLM retrievers, with the claim that similar effects could arise from benign document edits.

Significance. If the transferability results hold, the work is significant for highlighting practical robustness risks in deployed LLM retrieval pipelines. The theoretical framework plus the surrogate-based min-max simulation provides a principled way to study query-agnostic attacks, and the empirical validation on multiple retrievers strengthens the case that black-box threats are realistic. Credit is due for the zero-shot surrogate approach and the explicit framing of the attack as a simulation tool rather than a fitted model.

major comments (2)
  1. [§3] §3: The LLMR framework reduces retrieval to inner-product similarity, which is then used to justify the min-max attack formulation. However, the transferability premise—that surrogate-optimized tokens and learned queries will align with an unseen victim retriever’s embedding geometry without any query adaptation—is stated but not bounded or tested for distribution shift; this assumption is load-bearing for the central query-agnostic claim.
  2. [§5] §5 (empirical validation): The reported effectiveness on benchmarks lacks explicit controls for embedding-space misalignment between surrogate and victim models, effect-size reporting, or ablation studies isolating the contribution of the learnable query samples versus fixed queries. Without these, it is difficult to confirm that the attack generalizes query-agnostically rather than succeeding only under favorable alignment conditions.
minor comments (2)
  1. [Abstract] Abstract: The sentence 'which manipulates documents by token-level injections and enables adversaries to either boost or diminish these documents' contains a subject-verb agreement issue ('manipulates' should be 'manipulate' or rephrased).
  2. Notation: The distinction between surrogate parameters and victim parameters in the min-max objective could be clarified with an explicit variable table or consistent subscripting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing our perspective on the current manuscript and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3: The LLMR framework reduces retrieval to inner-product similarity, which is then used to justify the min-max attack formulation. However, the transferability premise—that surrogate-optimized tokens and learned queries will align with an unseen victim retriever’s embedding geometry without any query adaptation—is stated but not bounded or tested for distribution shift; this assumption is load-bearing for the central query-agnostic claim.

    Authors: Our theoretical framework explicitly models LLMR retrieval as inner-product similarity in embedding space and we empirically verify this modeling choice on the surrogate models used. The min-max formulation with learnable queries is designed to optimize for tokens that remain effective across query variations, which underpins the query-agnostic transfer claim. We demonstrate this empirically by transferring the resulting injection tokens to multiple victim retrievers with distinct embedding models, without any query-specific adaptation. We agree, however, that the manuscript would benefit from a more explicit treatment of distribution shift. In revision we will add a dedicated discussion subsection that derives the expected transfer conditions from the inner-product assumption and include new experiments that quantify embedding misalignment (via average cosine similarity on shared documents) between surrogate and victim models. revision: yes

  2. Referee: [§5] §5 (empirical validation): The reported effectiveness on benchmarks lacks explicit controls for embedding-space misalignment between surrogate and victim models, effect-size reporting, or ablation studies isolating the contribution of the learnable query samples versus fixed queries. Without these, it is difficult to confirm that the attack generalizes query-agnostically rather than succeeding only under favorable alignment conditions.

    Authors: We acknowledge that the current empirical section would be strengthened by additional controls and ablations. In the revised manuscript we will: (i) report quantitative measures of embedding-space misalignment between each surrogate and victim model on the evaluation datasets, (ii) include effect-size statistics (mean rank change with standard deviation and confidence intervals) alongside the existing success-rate tables, and (iii) add ablation experiments that compare the full adversarial-learning procedure against variants that use fixed or randomly sampled queries. These changes will more clearly isolate the contribution of the learnable-query component and support the query-agnostic generalization argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical framework and min-max simulation are independent of fitted outputs.

full rationale

The paper first establishes a theoretical framework for LLM-based retrieval based on embedding inner-product similarity and empirically verifies it on data. It then formulates the attack as a min-max optimization over surrogate parameters and learnable query samples to generate transferable tokens. This structure does not reduce any claimed prediction or result to its own inputs by construction, nor does it rely on load-bearing self-citations or imported uniqueness theorems. The transferability is presented as an empirical outcome validated on benchmarks rather than a definitional equivalence or fitted renaming. The derivation chain remains self-contained against external benchmarks with no quoted reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the theoretical framework and min-max simulation are referenced but not expanded.

pith-pipeline@v0.9.0 · 5801 in / 992 out tokens · 43580 ms · 2026-05-21T14:13:19.376072+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    A sur- vey of reasoning and agentic systems in time series with large language models.arXiv preprint arXiv:2509.11575,

    Chang, C., Shi, Y ., Cao, D., Yang, W., Hwang, J., Wang, H., Pang, J., Wang, W., Liu, Y ., Peng, W.-C., et al. A sur- vey of reasoning and agentic systems in time series with large language models.arXiv preprint arXiv:2509.11575,

  2. [2]

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tat- sunori B Hashimoto

    Dong, S., Xu, S., He, P., Li, Y ., Tang, J., Liu, T., Liu, H., and Xiang, Z. A practical memory injection attack against llm agents.arXiv preprint arXiv:2503.03704,

  3. [3]

    Attacking large language models with projected gradient descent

    Geisler, S., Wollschläger, T., Abdalla, M., Gasteiger, J., and Günnemann, S. Attacking large language models with projected gradient descent. InICML 2024 Next Generation of AI Safety Workshop. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y . Generative adversarial nets.Advances in neural ...

  4. [4]

    Hong, G., Kim, J., Kang, J., Myaeng, S.-H., and Whang, J. J. Why so gullible? enhancing the robustness of retrieval- augmented models against counterfactual noise. InFind- ings of the Association for Computational Linguistics: NAACL 2024, pp. 2474–2495,

  5. [5]

    arXiv preprint arXiv:2404.07981 , year=

    Hongjin, S., Yen, H., Xia, M., Shi, W., Muennighoff, N., Wang, H.-y., Haisu, L., Shi, Q., Siegel, Z. S., Tang, M., et al. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval. InThe Thirteenth Interna- tional Conference on Learning Representations. Jia, X., Pang, T., Du, C., Huang, Y ., Gu, J., Liu, Y ., Cao, X., and Lin, M. Imp...

  6. [6]

    Graphrag under fire.arXiv preprint arXiv:2501.14050,

    Liang, J., Wang, Y ., Li, C., Zhu, R., Jiang, T., Gong, N., and Wang, T. Graphrag under fire.arXiv preprint arXiv:2501.14050,

  7. [7]

    doi: 10.1109/SP54263.2024. 00049. Liu, Y .-A., Zhang, R., Guo, J., de Rijke, M., Chen, W., Fan, Y ., and Cheng, X. Black-box adversarial attacks against dense retrieval models: A multi-view contrastive learning method. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 1647–1656, 2023a. Liu, Y .-A., Zhang, ...

  8. [8]

    ISBN 9798400704314

    Association for Computing Machinery. ISBN 9798400704314. doi: 10.1145/3626772.3657704. Long, Q., Deng, Y ., Gan, L., Wang, W., and Pan, S. J. Whispers in grammars: Injecting covert backdoors to compromise dense retrieval systems.arXiv preprint arXiv:2402.13532,

  9. [9]

    J., and Huang, F

    Pathmanathan, P., Panaitescu-Liess, M.-A., Chiang, C.-Y . J., and Huang, F. Ragpart & ragmask: Retrieval-stage de- fenses against corpus poisoning in retrieval-augmented generation.arXiv preprint arXiv:2512.24268,

  10. [10]

    Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589, 2024

    Pfrommer, S., Bai, Y ., Gautam, T., and Sojoudi, S. Ranking manipulation for conversational search engines.arXiv preprint arXiv:2406.03589,

  11. [11]

    Ignore this title and hackaprompt: Ex- posing systemic vulnerabilities of llms through a global prompt hacking competition

    Schulhoff, S., Pinto, J., Khan, A., Bouchard, L.-F., Si, C., Anati, S., Tagliabue, V ., Kost, A., Carnahan, C., and Boyd-Graber, J. Ignore this title and hackaprompt: Ex- posing systemic vulnerabilities of llms through a global prompt hacking competition. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4945–4977,

  12. [12]

    do anything now

    Shen, X., Chen, Z., Backes, M., Shen, Y ., and Zhang, Y . " do anything now": Characterizing and evaluating in- the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 1671–1685,

  13. [13]

    Stealthrank: Llm ranking manipulation via stealthy prompt optimization.arXiv preprint arXiv:2504.05804,

    Tang, Y ., Fan, Y ., Yu, C., Yang, T., Zhao, Y ., and Hu, X. Stealthrank: Llm ranking manipulation via stealthy prompt optimization.arXiv preprint arXiv:2504.05804,

  14. [14]

    Bert rankers are brit- tle: a study using adversarial document perturbations

    Wang, Y ., Lyu, L., and Anand, A. Bert rankers are brit- tle: a study using adversarial document perturbations. In Proceedings of the 2022 ACM SIGIR International Con- ference on Theory of Information Retrieval, pp. 115–120,

  15. [15]

    Certifiably robust rag against retrieval corruption

    Xiang, C., Wu, T., Zhong, Z., Wagner, D., Chen, D., and Mittal, P. Certifiably robust rag against retrieval corrup- tion.arXiv preprint arXiv:2405.15556,

  16. [16]

    Maestro: Learning to collaborate via conditional listwise policy optimization for multi-agent llms.arXiv preprint arXiv:2511.06134, 2025a

    Yang, W., Pang, J., Li, S., Bogdan, P., Tu, S., and Thoma- son, J. Maestro: Learning to collaborate via conditional listwise policy optimization for multi-agent llms.arXiv preprint arXiv:2511.06134, 2025a. Yang, W., Weng, M., Pang, J., Cao, D., Ping, H., Zhang, P., Li, S., Zhao, Y ., Yang, Q., Wang, M., et al. Toward evolutionary intelligence: Llm-based a...

  17. [17]

    Someone Hid It!

    doi: 10.1145/3637870. Zhong, Z., Huang, Z., Wettig, A., and Chen, D. Poisoning retrieval corpora by injecting adversarial passages. In 10 “Someone Hid It!”: Query-Agnostic Black-Box Attacks on LLM-Based Retrieval The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a. Zhong, Z., Huang, Z., Wettig, A., and Chen, D. Poisoning retriev...

  18. [18]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,