Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks
Pith reviewed 2026-06-27 22:44 UTC · model grok-4.3
The pith
Integrating real-time language model predictions into acoustic attacks triples word error rates for real-time speech recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% -- a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.
What carries the argument
The Semantic Gambit attack, which augments acoustic adversarial perturbations with real-time predictive context from a large language model to overcome the causal constraint on ASR transcription.
If this is right
- Real-time ASR transcription decisions become far more error-prone once attackers can access predictive language context unavailable in the current audio segment.
- Acoustic perturbation generation can be optimized using semantic information from future words, raising corpus-level word error rates threefold.
- Common low-latency LLM tooling becomes a direct vector for subverting existing real-time speech pipelines without needing to alter the ASR model itself.
- Attack performance is now limited primarily by LLM prediction latency and accuracy rather than by the acoustic channel alone.
Where Pith is reading between the lines
- ASR systems might reduce vulnerability by running internal language model predictions to anticipate and normalize against externally supplied context.
- The same augmentation pattern could be tested on other causal real-time systems such as live captioning or streaming translation.
- If ASR pipelines begin to incorporate their own LLM priors for robustness, the relative advantage of the external attack might shrink unless the attacker also controls the internal model.
Load-bearing premise
The attack pipeline can obtain accurate, low-latency predictive context from an LLM in real time and integrate it into acoustic perturbation generation without the ASR system detecting or adapting to the augmented input.
What would settle it
Running the attack on the same corpus while withholding the LLM context or adding detection logic inside the ASR, then checking whether the word error rate stays near 12 percent instead of rising to 35.6 percent.
Figures
read the original abstract
Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% -- a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Semantic Gambit attack, which augments acoustic adversarial perturbations against real-time ASR systems with predictive context obtained from an LLM. This is claimed to break the causal information bottleneck inherent to streaming transcription, yielding a corpus-level WER of 35.6%—a three-fold improvement over prior state-of-the-art attacks.
Significance. If the empirical result can be reproduced with full experimental details, the work would establish a concrete, practical demonstration that low-latency LLM priors can be weaponized to degrade real-time ASR far beyond existing acoustic-only attacks. This would have direct implications for the robustness of deployed voice interfaces and would motivate new defenses that account for semantic lookahead.
major comments (2)
- [Abstract] Abstract: The headline result (corpus WER = 35.6 %, three-fold increase) is stated without any accompanying experimental protocol, dataset description, ASR model, baseline attack method, attack parameters, or error bars. Because the central claim is purely empirical, this omission renders the numerical improvement unverifiable from the manuscript.
- [Abstract] Abstract: The attack pipeline presupposes that an LLM can supply accurate, low-latency predictive context in real time and that the resulting augmented waveform evades both detection and adaptation by the target ASR; neither the latency budget nor any mechanism for avoiding detection is quantified or tested in the provided text, yet both are load-bearing for the reported performance.
minor comments (1)
- The manuscript would benefit from an explicit timing diagram or latency table showing end-to-end delay from acoustic frame to LLM context to perturbation generation.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback. We address each major comment below. Where the comments identify gaps in the abstract or missing quantifications, we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result (corpus WER = 35.6 %, three-fold increase) is stated without any accompanying experimental protocol, dataset description, ASR model, baseline attack method, attack parameters, or error bars. Because the central claim is purely empirical, this omission renders the numerical improvement unverifiable from the manuscript.
Authors: We agree the abstract is overly concise and omits key details needed to assess the empirical claim at a glance. The full manuscript (Sections 4–5) specifies the LibriSpeech test-clean corpus, Whisper-large-v3 ASR, comparison against the strongest published acoustic-only baseline, perturbation parameters (ε=0.05, 20 PGD steps), and reports mean WER with standard deviation over 5 random seeds. We will expand the abstract with a one-sentence experimental summary to improve verifiability without exceeding length limits. revision: yes
-
Referee: [Abstract] Abstract: The attack pipeline presupposes that an LLM can supply accurate, low-latency predictive context in real time and that the resulting augmented waveform evades both detection and adaptation by the target ASR; neither the latency budget nor any mechanism for avoiding detection is quantified or tested in the provided text, yet both are load-bearing for the reported performance.
Authors: The manuscript body (Section 3.2) describes the streaming LLM integration and reports measured per-token latency of 87 ms on average, which fits within typical 200 ms ASR chunk windows; Section 5.3 includes an adaptive-attack experiment where the ASR is fine-tuned on perturbed examples. However, we acknowledge that an explicit latency budget table and a dedicated detection-evasion subsection are absent. We will add both in the revision. revision: yes
Circularity Check
No significant circularity; purely empirical result
full rationale
The paper presents an empirical attack pipeline that augments acoustic perturbations with real-time LLM predictive context to raise ASR WER. No equations, fitted parameters, self-citations as load-bearing premises, or derivation chains appear in the abstract or reader's summary. The 35.6% WER figure is reported as an experimental outcome, not derived from any internal construction or renamed prior result. The central claim therefore stands as an independent empirical finding with no reduction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Audio adversarial examples: Targeted attacks on speech- to-text
Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech- to-text. InIEEE Security and Privacy Workshops (SPW), 2018
2018
-
[2]
Watch what you pretrain for: Targeted, transferable adversarial examples on self-supervised speech recognition models
Raphael Olivier, Hadi Abdullah, and Bhiksha Raj. Watch what you pretrain for: Targeted, transferable adversarial examples on self-supervised speech recognition models. InIEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2024
2024
-
[3]
ALIF: Low-cost adversarial audio attacks on black-box speech platforms using linguistic features
Peng Cheng, Yuwei Wang, Peng Huang, Zhongjie Ba, Xiaodong Lin, Feng Lin, Li Lu, and Kui Ren. ALIF: Low-cost adversarial audio attacks on black-box speech platforms using linguistic features. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024. doi: 10.1145/3658644.3670356
-
[4]
Real-time neural voice camouflage
Mia Chiquier, Chengzhi Mao, and Carl V ondrick. Real-time neural voice camouflage. In International Conference on Learning Representations (ICLR), 2022
2022
-
[5]
wav2vec 2.0: A framework for self-supervised learning of speech representations
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[6]
Imperceptible, robust, and targeted adversarial examples for automatic speech recognition
Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019
2019
-
[7]
Universal adversarial perturbations for speech recognition systems
Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. Universal adversarial perturbations for speech recognition systems. In Interspeech, pages 481–485, 2019
2019
-
[8]
CommanderUAP: Practical and transferable universal adversarial perturbations on ASR systems.Cybersecurity,
Qibin Sun, Shun Chen, Yingbin Zhai, Yang Liu, and Zhisheng Zhong. CommanderUAP: Practical and transferable universal adversarial perturbations on ASR systems.Cybersecurity,
-
[9]
doi: 10.1093/cybsec/tyae003. 10
-
[10]
Adversar- ial attacks against automatic speech recognition systems via psychoacoustic hiding
Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Adversar- ial attacks against automatic speech recognition systems via psychoacoustic hiding. InNetwork and Distributed System Security Symposium (NDSS), 2019. doi: 10.14722/ndss.2019.23288
-
[11]
Perceptual Based Adversarial Audio Attacks
Joseph Szurley and J. Zico Kolter. Perceptual based adversarial audio attacks.arXiv preprint arXiv:1906.06355, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[12]
Real-time adversarial attacks
Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi. Real-time adversarial attacks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pages 4672–4680, 2019
2019
-
[13]
Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd International Conference on Machine Learning (ICML), pages 369–376. ACM, 2006
2006
-
[14]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021
2021
-
[15]
LibriSpeech: An ASR corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ ICASSP.2015.7178964
-
[16]
Common V oice: A massively-multilingual speech corpus
Romain Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Reuben Henretty, Gabriel Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common V oice: A massively-multilingual speech corpus. InProceedings of the 12th Language Resources and Evaluation Conference (LREC), pages 4218–4222, Marseille, France, 2020. URL https: //commonvoice.m...
2020
-
[17]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291
-
[18]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 28448–28493. PMLR, 2023
2023
-
[19]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, and Alessan- dro L
Sajjad Abdoli, Luiz G. Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, and Alessan- dro L. Koerich. Universal adversarial audio perturbations.arXiv preprint arXiv:1908.03173, 2019
-
[21]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations (ICLR), 2018
2018
-
[22]
Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. InInterspeech, pages 1268–1272, 2019. doi: 10.21437/Interspeech.2019-1982
-
[23]
Le, and Oriol Vinyals
William Chan, Navdeep Jaitly, Quoc V . Le, and Oriol Vinyals. Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016. 11
2016
-
[24]
WaveGuard: Understanding and mitigating audio adversarial examples
Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. WaveGuard: Understanding and mitigating audio adversarial examples. In30th USENIX Security Symposium (USENIX Security 21), pages 2273–2290. USENIX Association, 2021
2021
-
[25]
Robust audio adversarial example for a physical attack
Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), 2018
2018
-
[26]
Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. CommanderSong: A systematic approach for practical adversarial voice recognition. InProceedings of the 27th USENIX Security Symposium (USENIX Security 2018), pages 49–64. USENIX Association, 2018
2018
-
[27]
Targeted universal adversarial perturbations for automatic speech recognition
Yonghong Zong, Xiyang Zhang, Peiyu Hou, and Bo Wang. Targeted universal adversarial perturbations for automatic speech recognition. InInformation Security Conference (ISC), volume 13118 ofLecture Notes in Computer Science, 2021
2021
-
[28]
Transferable adversarial perturbations between self-supervised speech recognition models
Raphael Olivier, Hadi Abdullah, and Bhiksha Raj. Transferable adversarial perturbations between self-supervised speech recognition models. InICML 2023 Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers), 2023
2023
-
[29]
Abhishek Raina and Mark Gales. Controlling Whisper: Universal acoustic attacks on speech foundation models.arXiv preprint arXiv:2404.01234, 2024
-
[30]
Deep Speech 2: End- to-end speech recognition in English and Mandarin
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jing Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guilin Chen, et al. Deep Speech 2: End- to-end speech recognition in English and Mandarin. InProceedings of the 33rd International Conference on Machine Learning (ICML), pages 173–182, 2016
2016
-
[31]
BERT-ATTACK: Adversarial attack against BERT using BERT
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial attack against BERT using BERT. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, 2020
2020
-
[32]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries.arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Judging LLM-as-a-judge with MT-bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[34]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators.arXiv preprint arXiv:2305.17926, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Towards fast and accurate streaming end-to-end ASR
Bo Li, Shuo-yiin Chang, Tara N Sainath, Ruoming Pang, Yanzhang He, Trevor Strohman, and Yonghui Wu. Towards fast and accurate streaming end-to-end ASR. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6069–6073, 2020
2020
-
[36]
Char embedding
Hadi Abdullah, Muhammad Sajidur Rahman, Christian Peeters, Cassidy Gibson, Washington Garcia, Vincent Bindschaedler, Thomas Shrimpton, and Patrick Traynor. Beyond lp clipping: Equalization-based psychoacoustic attacks against ASRs. InAsian Conference on Machine Learning (ACML), volume 157 ofProceedings of Machine Learning Research, 2021. A Training and Ar...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.