pith. sign in

arxiv: 2605.30365 · v1 · pith:DM6PJLIYnew · submitted 2026-05-18 · 💻 cs.SD · cs.AI· eess.AS

Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation

Pith reviewed 2026-06-30 18:55 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords caption poisoningretrieval-augmented generationtext-to-music generationadversarial attackmusic caption datasetprompt augmentationCLAP retrieverMusicGen
0
0 comments X

The pith

An attacker can inject a small number of crafted music captions to poison the retrieval database and steer text-to-music generation toward a chosen target without altering the user prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented text-to-music systems expand underspecified user prompts by pulling captions from a music knowledge database before feeding them to the generator. The paper shows that an attacker can compromise this process by adding only a few specially designed captions to the database. These captions use a dual-layer design that keeps them retrievable for relevant queries while embedding acoustic details that bias the augmented prompt. Experiments on the MusicCaps dataset with a CLAP retriever and MusicGen generator demonstrate that outputs move measurably closer to the attacker's intended acoustic character while staying aligned with the original query. The results identify a database integrity vulnerability that affects creative retrieval-augmented systems.

Core claim

We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent.

What carries the argument

Dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and generation

If this is right

  • Poisoned generations move substantially closer to the attacker's target.
  • Generations remain comparably aligned with the original user query.
  • The attack succeeds by injecting only a small number of crafted captions.
  • No modification to the user prompt, retriever, or generator is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same database-poisoning approach could apply to retrieval-augmented systems in other modalities such as text-to-image generation.
  • Database verification or provenance tracking would be needed to block this class of integrity attack.
  • The attack's success depends on the retriever's tolerance for low-level descriptor conflicts within high-level anchor matches.

Load-bearing premise

The dual-layer caption poisoning strategy preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent.

What would settle it

If the CLAP retriever on the poisoned MusicCaps database fails to return the crafted captions for relevant user prompts or if MusicGen outputs show no measurable shift toward the target acoustic descriptors while alignment with the original query is preserved, the attack claim is falsified.

Figures

Figures reproduced from arXiv: 2605.30365 by Hanqing Guo, Long Cheng, Nan Zhang, Shuhao Zhang, Yizhu Wen.

Figure 1
Figure 1. Figure 1: RAG-based caption augmentation for text-to-music generation: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the poisoned caption generation pipeline. Given a target query [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A concrete example of a caption poisoning attack [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for generating acoustically similar but functionally inverted caption pairs. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Retrieval-augmented text-to-music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset. This design introduces an integrity dependency on the music knowledge database. We show that an attacker can poison the database by injecting a small number of crafted music captions, causing the system to retrieve malicious captions that bias prompt augmentation and steer generation away from the user's intended function, without modifying the user prompt, retriever, or generator. To achieve the music caption poisoning attack, we propose a dual-layer caption poisoning strategy that preserves high-level retrieval anchors while injecting low-level acoustic descriptors to steer prompt augmentation and downstream music generation toward an attacker-chosen target intent. In a MusicCaps knowledge database, CLAP retriever, and MusicGen pipeline, poisoned generations move substantially closer to the attacker's target, while remaining comparably aligned with the original user query. These results expose a practical integrity risk for retrieval-augmented creative AI systems. Our demo can be found at: https://yizhu-wen.github.io/Mental-Damage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that retrieval-augmented text-to-music systems can be attacked via caption poisoning: an attacker injects a small number of dual-layer crafted captions into the database (e.g., MusicCaps) that preserve high-level anchors for CLAP retrieval while injecting low-level acoustic descriptors to bias MusicGen generations toward an attacker-chosen target intent, all without altering the user prompt, retriever, or generator. Experiments reportedly show poisoned outputs move substantially closer to the target while remaining comparably aligned with the original query.

Significance. If the empirical results hold under proper controls, the work identifies a practical integrity vulnerability in RAG-based creative AI systems, which is timely given increasing deployment of retrieval-augmented generators. The public demo strengthens the contribution by enabling direct inspection of the attack.

major comments (2)
  1. [Method section on dual-layer caption poisoning strategy] The dual-layer caption poisoning strategy (described in the method) assumes that low-level acoustic descriptor injection preserves high-level retrieval anchors in CLAP embeddings. No section reports quantitative evidence such as cosine similarity shifts, embedding distance distributions, or pre/post-poisoning retrieval rank statistics for the poisoned captions under original user queries. This is load-bearing for the central claim, as failure to remain in top-k would invalidate the attack.
  2. [Experimental results section] The results claim that 'poisoned generations move substantially closer to the attacker's target' (abstract and experimental section) but provides no specific metrics, baselines, controls, or statistical tests (e.g., no CLAP or other embedding distances to target, no comparison tables, no effect sizes). This prevents assessment of whether the data supports the 'substantially closer' assertion.
minor comments (2)
  1. Ensure all experimental details (dataset splits, exact poisoning ratios, retrieval k, generation parameters) are fully specified to support reproducibility.
  2. The demo link is useful; confirm it remains accessible and includes code for the poisoning construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas where additional quantitative support will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested evidence and metrics.

read point-by-point responses
  1. Referee: [Method section on dual-layer caption poisoning strategy] The dual-layer caption poisoning strategy (described in the method) assumes that low-level acoustic descriptor injection preserves high-level retrieval anchors in CLAP embeddings. No section reports quantitative evidence such as cosine similarity shifts, embedding distance distributions, or pre/post-poisoning retrieval rank statistics for the poisoned captions under original user queries. This is load-bearing for the central claim, as failure to remain in top-k would invalidate the attack.

    Authors: We agree that quantitative validation of the high-level anchor preservation is necessary for the central claim. In the revised manuscript we will add a new analysis subsection (or appendix) reporting cosine similarity shifts between original and poisoned caption embeddings in CLAP space, embedding distance distributions, and pre/post-poisoning retrieval rank statistics under the original user queries. These results will confirm that the poisoned captions remain within the top-k retrieved set. revision: yes

  2. Referee: [Experimental results section] The results claim that 'poisoned generations move substantially closer to the attacker's target' (abstract and experimental section) but provides no specific metrics, baselines, controls, or statistical tests (e.g., no CLAP or other embedding distances to target, no comparison tables, no effect sizes). This prevents assessment of whether the data supports the 'substantially closer' assertion.

    Authors: We acknowledge that the current experimental section lacks the specific quantitative details needed to evaluate the strength of the 'substantially closer' claim. The revised version will expand the results with CLAP (and other) embedding distances to the target, explicit baseline comparisons, summary tables, effect sizes, and appropriate statistical tests (e.g., paired significance tests) while clarifying the experimental controls. This will allow readers to directly assess the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical attack demonstration with no derivation chain

full rationale

The paper presents an empirical demonstration of a caption poisoning attack on retrieval-augmented text-to-music systems. It proposes a dual-layer strategy and evaluates it on MusicCaps + CLAP + MusicGen, showing poisoned generations shift toward attacker targets while remaining aligned with user queries. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claim rests on experimental outcomes rather than any reduction of outputs to inputs by construction. This is the expected non-finding for an attack paper without theoretical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5728 in / 1178 out tokens · 30481 ms · 2026-06-30T18:55:14.360783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    MusicLM: Generating Music From Text

    A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchiet al., “Musiclm: Generating music from text,”arXiv preprint arXiv:2301.11325, 2023

  2. [2]

    Simple and controllable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D ´efossez, “Simple and controllable music generation,”Ad- vances in neural information processing systems, vol. 36, pp. 47 704– 47 720, 2023

  3. [3]

    Audioldm: Text-to-audio generation with latent diffusion models,

    H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,”arXiv preprint arXiv:2301.12503, 2023

  4. [4]

    Music discovery dialogue generation using human intent analysis and large language model,

    S. Doh, K. Choi, D. Kwon, T. Kim, and J. Nam, “Music discovery dialogue generation using human intent analysis and large language model,” inInternational Society for Music Information Retrieval Con- ference 2024. International Society for Music Information Retrieval, 2024

  5. [5]

    A survey of music similarity and recommen- dation from music context data,

    P. Knees and M. Schedl, “A survey of music similarity and recommen- dation from music context data,”ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 10, no. 1, pp. 1–21, 2013

  6. [6]

    A retrieval augmented approach for text-to-music generation,

    R. Gonzales and F. Rudzicz, “A retrieval augmented approach for text-to-music generation,” inProceedings of the 3rd Workshop on NLP for Music and Audio (NLP4MusA), 2024, pp. 31–36

  7. [7]

    Enhancing test-to-music generation through retrieval-augmented prompt rewrite,

    M. Ding, B. McFee, C. Hu, S. Yang, and J. Huang, “Enhancing test-to-music generation through retrieval-augmented prompt rewrite,” in1st Workshop on Large Language Models for Music & Audio (LLM4MA), 2025

  8. [8]

    {PoisonedRAG}: Knowl- edge corruption attacks to{Retrieval-Augmented}generation of large language models,

    W. Zou, R. Geng, B. Wang, and J. Jia, “{PoisonedRAG}: Knowl- edge corruption attacks to{Retrieval-Augmented}generation of large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 3827–3844

  9. [9]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 9459–9474

  10. [10]

    Beyond the status quo: A contemporary survey of advances and challenges in audio captioning,

    X. Xu, Z. Xie, M. Wu, and K. Yu, “Beyond the status quo: A contemporary survey of advances and challenges in audio captioning,” IEEE/ACM Transactions on Audio, Speech, and Language Process- ing, vol. 32, pp. 95–112, 2023

  11. [11]

    Clap: Learn- ing audio concepts from natural language supervision,

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap: Learn- ing audio concepts from natural language supervision,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  12. [12]

    google/musiccaps dataset card,

    Hugging Face, “google/musiccaps dataset card,” https://huggingface. co/datasets/google/MusicCaps, 2023, dataset card for MusicCaps

  13. [13]

    Retrieval-augmented diffusion models,

    A. Blattmann, R. Rombach, K. Oktay, J. M ¨uller, and B. Ommer, “Retrieval-augmented diffusion models,”Advances in Neural Infor- mation Processing Systems, vol. 35, pp. 15 309–15 324, 2022

  14. [14]

    kNN-diffusion: Image generation via large-scale retrieval,

    S. Sheynin, O. Ashual, A. Polyak, U. Singer, O. Gafni, E. Nachmani, and Y . Taigman, “kNN-diffusion: Image generation via large-scale retrieval,” inThe Eleventh International Conference on Learning Representations, 2023

  15. [15]

    Retrieval-augmented mul- timodal language modeling,

    M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W.-t. Yih, “Retrieval-augmented mul- timodal language modeling,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023

  16. [16]

    Recap: Retrieval-augmented audio captioning,

    S. Ghosh, S. Kumar, C. K. R. Evuru, R. Duraiswami, and D. Manocha, “Recap: Retrieval-augmented audio captioning,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1161– 1165

  17. [17]

    Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning,

    C. Choi, S. Lim, and W. Rhee, “Enhancing Retrieval-Augmented Audio Captioning with Generation-Assisted Multimodal Querying and Progressive Learning,” inInterspeech 2025, 2025, pp. 2655–2659

  18. [18]

    Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation,

    M. Yang, B. Shi, M. Le, W.-N. Hsu, and A. Tjandra, “Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation,” inInterspeech 2025, 2025, pp. 1243–1247

  19. [19]

    Feedback-driven retrieval-augmented audio generation with large audio language models,

    J. Zhao, C. Li, J. Zhao, R. Chen, D. Yu, M. D. Plumbley, and W. Wang, “Feedback-driven retrieval-augmented audio generation with large audio language models,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026

  20. [20]

    Research on architecture for long-tailed genre computer intelligent classification with music information retrieval and deep learning,

    G. Sun, “Research on architecture for long-tailed genre computer intelligent classification with music information retrieval and deep learning,” inJournal of Physics: Conference Series, vol. 2033, no. 1. IOP Publishing, 2021, p. 012008

  21. [21]

    Realm: retrieval-augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang, “Realm: retrieval-augmented language model pre-training,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

  22. [22]

    Multimodal retrieval-augmented generation framework for visually rich knowledge in the architecture domain,

    X. Meng and Z. Tong, “Multimodal retrieval-augmented generation framework for visually rich knowledge in the architecture domain,” Architectural Intelligence, vol. 4, no. 1, p. 22, 2025

  23. [23]

    Poisonedeye: Knowledge poisoning attack on retrieval-augmented generation based large vision-language models,

    C. Zhang, X. Zhang, J. Lou, K. Wu, Z. Wang, and X. Chen, “Poisonedeye: Knowledge poisoning attack on retrieval-augmented generation based large vision-language models,” inForty-second In- ternational Conference on Machine Learning, 2025

  24. [24]

    Poisoned-mrag: Knowledge poisoning attacks to multimodal retrieval augmented generation,

    Y . Liu, Z. Yuan, G. Tie, J. Shi, P. Zhou, L. Sun, and N. Z. Gong, “Poisoned-mrag: Knowledge poisoning attacks to multimodal retrieval augmented generation,”arXiv preprint arXiv:2503.06254, 2025

  25. [25]

    Claude sonnet 4.6,

    Anthropic, “Claude sonnet 4.6,” https://www.anthropic.com, 2024, large language model

  26. [26]

    Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

    R. Huang, J. Huang, D. Yang, Y . Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 13 916–13 932. Appendix Attack example Benign user intent:Sleep and rest audio User query:I want something very slow and si...