Benchmarking Gaslighting Attacks Against Speech Large Language Models

Bin Zhu; Jinyang Wu; Pan Zhou; Qiquan Zhang; Xiandong Zou; Xu Fang

arxiv: 2509.19858 · v2 · pith:LHEDBXLSnew · submitted 2025-09-24 · 💻 cs.CL

Benchmarking Gaslighting Attacks Against Speech Large Language Models

Jinyang Wu , Bin Zhu , Xiandong Zou , Qiquan Zhang , Xu Fang , Pan Zhou This is my paper

Pith reviewed 2026-05-25 08:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords gaslighting attacksspeech large language modelsadversarial robustnessmodel vulnerabilitymanipulation strategiesvoice-based AIaccuracy degradation

0 comments

The pith

Speech large language models suffer an average 24.3 percent accuracy drop when exposed to five gaslighting attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that speech large language models are vulnerable to deliberately manipulative prompts that aim to mislead or distort their reasoning. It tests this through five specific strategies applied to five models across more than ten thousand samples from five datasets. A sympathetic reader would care because these models are entering voice-based applications where such inputs could produce unreliable or altered behavior. The evaluation tracks both accuracy loss and secondary responses such as unsolicited apologies and refusals. Acoustic perturbations are also examined to probe multi-modal effects.

Core claim

Speech LLMs exhibit significant behavioral vulnerability to gaslighting attacks constructed from five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation. Comprehensive testing on five models and over 10,000 samples from five diverse datasets produces an average accuracy drop of 24.3 percent, while also recording changes in behavioral outputs including apologies and refusals; separate acoustic perturbation tests assess multi-modal robustness.

What carries the argument

The gaslighting attacks framework built from five named manipulation strategies that are designed to mislead, override, or distort model reasoning across varied tasks.

If this is right

Voice-based applications using the tested models become less reliable when users employ anger, sarcasm, or negation in prompts.
Performance degradation occurs alongside measurable changes in unsolicited apologies and refusal behavior.
Acoustic perturbations can be combined with the textual strategies to probe additional dimensions of multi-modal robustness.
The findings indicate a need for more resilient design in speech-based AI systems to handle manipulative inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Voice assistants could be prompted into incorrect actions by users who adopt the tested manipulation styles in everyday conversation.
The emphasis on speech ambiguity suggests that similar prompt-based attacks might transfer to text-only models but with lower success rates.
Adding the five strategies to existing robustness benchmarks would allow direct comparison of vulnerability across text, vision, and speech modalities.

Load-bearing premise

The five manipulation strategies constitute valid and representative gaslighting inputs that expose real vulnerabilities rather than artifacts of prompt engineering.

What would settle it

Re-running the evaluations on the same five models and datasets with the five strategies applied and finding no accuracy drop near 24.3 percent or no corresponding behavioral shifts would falsify the reported vulnerability.

read the original abstract

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures a 24.3% accuracy drop on speech LLMs from five named manipulation strategies and tracks behavioral signals, but the strategies need explicit validation to show they are not just generic difficult prompts.

read the letter

The main point is that five strategies labeled Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation produce an average 24.3% accuracy drop across five speech and multimodal LLMs on more than 10,000 samples from five datasets. The work also records unsolicited apologies and refusals as extra signals and adds acoustic perturbation tests for multi-modal robustness. That scale of evaluation is the clearest positive here. The framing as gaslighting attacks on speech inputs is new enough in the abstract, since prior adversarial work has stayed mostly in text or vision-language settings, and the speech-specific issues of ambiguity and continuity get mentioned as motivation. The paper does a straightforward job of running the tests and reporting both performance and behavioral outcomes. The soft spot is exactly the one in the stress-test note. The abstract gives no detail on how the five strategies were built, whether they were rated by humans as actual gaslighting, or how they compare to length-matched negative prompts or random perturbations. Without those checks the 24.3% drop could come from prompt difficulty in general rather than the claimed mechanisms. If the full paper supplies the templates, human validation scores, and control conditions, that concern shrinks; if not, the central claim stays hard to interpret. This is the sort of empirical robustness note that matters for people building voice interfaces. A reader working on speech LLM safety or adversarial testing would get a usable set of attack ideas to try. It is not a field-reorganizing result, but the evaluation is broad enough that a serious referee should look at the methods section to settle the validation question. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces 'gaslighting attacks' on Speech LLMs via five manipulation strategies (Anger, Cognitive Disruption, Sarcasm, Implicit, Professional Negation). It evaluates five models on over 10,000 samples from five datasets, reporting an average 24.3% accuracy drop, behavioral responses such as apologies, and results from acoustic perturbation experiments.

Significance. The scale of the evaluation (multiple models, datasets, and sample size) is a strength. If the strategies can be shown to isolate gaslighting effects rather than generic prompt artifacts, the results would usefully highlight vulnerabilities in speech-based systems and motivate robustness improvements.

major comments (2)

[Abstract] Abstract: The five strategies are described as 'designed to test model robustness across varied tasks,' yet no details are supplied on their construction, validation against human gaslighting judgments, or controls (e.g., length-matched negative prompts, random perturbations, or negativity-matched baselines). This directly undermines attribution of the 24.3% drop to the claimed mechanism.
[Evaluation / Methods] Evaluation / Methods (assumed sections): The manuscript supplies no information on pre-specified dataset splits, multiple-run statistical testing for the accuracy drop, or explicit baselines that would distinguish the named strategies from standard adversarial or confusing inputs. Without these, the central numerical claim cannot be verified as model-specific vulnerability rather than prompt engineering effects.

minor comments (1)

[Abstract] Abstract: Consider naming the five models and five datasets explicitly to improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and for recognizing the scale of the evaluation as a strength. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The five strategies are described as 'designed to test model robustness across varied tasks,' yet no details are supplied on their construction, validation against human gaslighting judgments, or controls (e.g., length-matched negative prompts, random perturbations, or negativity-matched baselines). This directly undermines attribution of the 24.3% drop to the claimed mechanism.

Authors: We agree that greater transparency on strategy construction is needed to support attribution of the observed accuracy drop. In the revised manuscript we will add a dedicated Methods subsection that describes the linguistic and psychological principles underlying each of the five strategies (Anger, Cognitive Disruption, Sarcasm, Implicit, Professional Negation). We will also incorporate new control experiments using length-matched neutral prompts and negativity-matched baselines. Human validation against explicit gaslighting judgments was not conducted in the original study; we will state this limitation explicitly rather than claim such validation occurred. revision: yes
Referee: [Evaluation / Methods] Evaluation / Methods (assumed sections): The manuscript supplies no information on pre-specified dataset splits, multiple-run statistical testing for the accuracy drop, or explicit baselines that would distinguish the named strategies from standard adversarial or confusing inputs. Without these, the central numerical claim cannot be verified as model-specific vulnerability rather than prompt engineering effects.

Authors: We will revise the Evaluation section to document the exact dataset splits (standard test partitions from the source datasets) and to report results across multiple runs with standard deviations. We will further add explicit baseline comparisons against generic adversarial and confusing prompts to help isolate the contribution of the named gaslighting strategies. These clarifications and additional controls will be included in the revised version. revision: yes

Circularity Check

0 steps flagged

Empirical measurement with no load-bearing derivation reducing to self-inputs

full rationale

The paper reports a direct empirical result: average accuracy drop of 24.3% measured across 5 external Speech/multi-modal LLMs and >10k samples from 5 datasets when applying five author-constructed prompt strategies. No equations, fitted parameters, or self-citation chains are invoked to derive this number; the strategies are openly presented as constructed test inputs rather than proven instances of an external gaslighting definition. The central claim is therefore a measurement, not a derivation that collapses to its own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard machine-learning evaluation assumptions and introduces only named attack categories rather than new physical or mathematical entities.

axioms (1)

domain assumption Accuracy and refusal rates are appropriate metrics for measuring susceptibility of Speech LLMs to manipulative prompts
Used to quantify the 24.3% drop and behavioral responses

pith-pipeline@v0.9.0 · 5772 in / 1273 out tokens · 40320 ms · 2026-05-25T08:03:47.687595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 9 internal anchors

[1]

Benchmarking Gaslighting Attacks Against Speech Large Language Models

INTRODUCTION Recent advances in Speech Large Language Models (Speech LLMs) have enabled multimodal agents to understand and reason over spo- ken inputs, unlocking powerful capabilities across domains such as emotion recognition, audio-grounded question answering, and spo- ken dialogue understanding [1, 2, 3]. By integrating high-capacity speech encoders w...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

What is the speaker’s emotion?

METHOD We design a multi-faceted evaluation methodology to assess the ro- bustness of Speech LLMs against gaslighting style prompts. Our ap- proach includes gaslighting based adversarial prompting, behavioral response annotation, and controlled acoustic ablation. 2.1. Gaslighting Attack As shown in Figure 1, we simulate gaslighting via a two-stage prompti...

work page
[3]

I’m sorry

EXPERIMENTAL RESULTS 3.1. Adversarial Prompting and Model Setting Five prompt types are constructed: Anger, Sarcasm, Cognitive, Im- plicit, and Professional. Each representing a distinct manipulation strategy grounded in human communication. These prompts vary in emotional tone and argumentative structure, ranging from ridicule and doubt to confident auth...

work page
[4]

CONCLUSION We have presented a comprehensive evaluation of Speech Large Language Models under gaslighting style adversarial prompting, uncovering critical vulnerabilities in both prediction accuracy and behavioral consistency. Through a set of strategically designed ma- nipulation types and a behavior-aware benchmark, we demonstrate 0.2 0.3 0.4 0.5 0.6 0....

work page
[5]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhi- fang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Speechgpt: Empowering large language models with intrinsic cross-modal conversa- tional abilities,

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversa- tional abilities,” 2023

work page 2023
[10]

Don’t deceive me: Mitigating gaslight- ing through attention reallocation in lmms,

Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang, “Don’t deceive me: Mitigating gaslight- ing through attention reallocation in lmms,”arXiv preprint arXiv:2504.09456, 2025

work page arXiv 2025
[11]

Reasoning models are more easily gaslighted than you think,

Bin Zhu, Hailong Yin, Jingjing Chen, and Yu-Gang Jiang, “Reasoning models are more easily gaslighted than you think,” arXiv preprint arXiv:2506.09677, 2025

work page arXiv 2025
[12]

Calling a spade a heart: Gaslight- ing multimodal large language models via negation,

Bin Zhu, Huiyan Qi, Yinxuan Gui, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim, “Calling a spade a heart: Gaslight- ing multimodal large language models via negation,”arXiv preprint arXiv:2501.19017, 2025

work page arXiv 2025
[13]

Is there an ironic tone of voice?,

Gregory A Bryant and Jean E Fox Tree, “Is there an ironic tone of voice?,”Language and speech, vol. 48, no. 3, pp. 257–277, 2005

work page 2005
[14]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in con- versations,”arXiv preprint arXiv:1810.02508, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Distilling an end-to-end voice as- sistant without instruction training data,

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, and Diyi Yang, “Distilling an end-to-end voice as- sistant without instruction training data,”arXiv preprint arXiv:2410.02678, 2024

work page arXiv 2024
[16]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ra- maneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng, “Mmsu: A massive multi-task spoken language understanding and rea- soning benchmark,”arXiv preprint arXiv:2506.04779, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li, “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Can a suit of armor conduct electricity? a new dataset for open book question answering,

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabhar- wal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inEMNLP, 2018

work page 2018
[20]

V ocalsound: A dataset for improving human vocal sounds recognition,

Yuan Gong, Jin Yu, and James Glass, “V ocalsound: A dataset for improving human vocal sounds recognition,” in ICASSP 2022-2022 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 151–155

work page 2022
[21]

Beyond accuracy: Behavioral testing of NLP models with CheckList,

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh, “Beyond accuracy: Behavioral testing of NLP models with CheckList,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, Eds., Online, July 2020, pp. 4902–4912, Association for Computa...

work page 2020
[22]

Towards unified prompt tuning for few-shot text classification,

Jianing Wang, Chengyu Wang, Fuli Luo, Chuanqi Tan, Minghui Qiu, Fei Yang, Qiuhui Shi, Songfang Huang, and Ming Gao, “Towards unified prompt tuning for few-shot text classification,” 2022

work page 2022
[23]

Active learning literature survey,

Burr Settles, “Active learning literature survey,” 2009

work page 2009
[24]

Concealed data poisoning attacks on nlp models,

Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh, “Concealed data poisoning attacks on nlp models,” 2021

work page 2021
[25]

In- triguing properties of neural networks,

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, “In- triguing properties of neural networks,” 2014

work page 2014
[26]

Neural models for reasoning over mul- tiple mentions using coreference,

Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov, “Neural models for reasoning over mul- tiple mentions using coreference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), Marilyn Walker, Heng Ji, and A...

work page 2018
[27]

Measuring and improving model-moderator collaboration us- ing uncertainty estimation,

Ian D. Kivlichan, Zi Lin, Jeremiah Liu, and Lucy Vasserman, “Measuring and improving model-moderator collaboration us- ing uncertainty estimation,” 2021

work page 2021
[28]

Prefix-tuning: Optimizing continuous prompts for generation,

Xiang Lisa Li and Percy Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, Eds., On- line...

work page 2021

[1] [1]

Benchmarking Gaslighting Attacks Against Speech Large Language Models

INTRODUCTION Recent advances in Speech Large Language Models (Speech LLMs) have enabled multimodal agents to understand and reason over spo- ken inputs, unlocking powerful capabilities across domains such as emotion recognition, audio-grounded question answering, and spo- ken dialogue understanding [1, 2, 3]. By integrating high-capacity speech encoders w...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

What is the speaker’s emotion?

METHOD We design a multi-faceted evaluation methodology to assess the ro- bustness of Speech LLMs against gaslighting style prompts. Our ap- proach includes gaslighting based adversarial prompting, behavioral response annotation, and controlled acoustic ablation. 2.1. Gaslighting Attack As shown in Figure 1, we simulate gaslighting via a two-stage prompti...

work page

[3] [3]

I’m sorry

EXPERIMENTAL RESULTS 3.1. Adversarial Prompting and Model Setting Five prompt types are constructed: Anger, Sarcasm, Cognitive, Im- plicit, and Professional. Each representing a distinct manipulation strategy grounded in human communication. These prompts vary in emotional tone and argumentative structure, ranging from ridicule and doubt to confident auth...

work page

[4] [4]

CONCLUSION We have presented a comprehensive evaluation of Speech Large Language Models under gaslighting style adversarial prompting, uncovering critical vulnerabilities in both prediction accuracy and behavioral consistency. Through a set of strategically designed ma- nipulation types and a behavior-aware benchmark, we demonstrate 0.2 0.3 0.4 0.5 0.6 0....

work page

[5] [5]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhi- fang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Speechgpt: Empowering large language models with intrinsic cross-modal conversa- tional abilities,

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversa- tional abilities,” 2023

work page 2023

[10] [10]

Don’t deceive me: Mitigating gaslight- ing through attention reallocation in lmms,

Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, and Yu-Gang Jiang, “Don’t deceive me: Mitigating gaslight- ing through attention reallocation in lmms,”arXiv preprint arXiv:2504.09456, 2025

work page arXiv 2025

[11] [11]

Reasoning models are more easily gaslighted than you think,

Bin Zhu, Hailong Yin, Jingjing Chen, and Yu-Gang Jiang, “Reasoning models are more easily gaslighted than you think,” arXiv preprint arXiv:2506.09677, 2025

work page arXiv 2025

[12] [12]

Calling a spade a heart: Gaslight- ing multimodal large language models via negation,

Bin Zhu, Huiyan Qi, Yinxuan Gui, Jingjing Chen, Chong-Wah Ngo, and Ee-Peng Lim, “Calling a spade a heart: Gaslight- ing multimodal large language models via negation,”arXiv preprint arXiv:2501.19017, 2025

work page arXiv 2025

[13] [13]

Is there an ironic tone of voice?,

Gregory A Bryant and Jean E Fox Tree, “Is there an ironic tone of voice?,”Language and speech, vol. 48, no. 3, pp. 257–277, 2005

work page 2005

[14] [14]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in con- versations,”arXiv preprint arXiv:1810.02508, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Distilling an end-to-end voice as- sistant without instruction training data,

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, and Diyi Yang, “Distilling an end-to-end voice as- sistant without instruction training data,”arXiv preprint arXiv:2410.02678, 2024

work page arXiv 2024

[16] [16]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ra- maneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha, “Mmau: A massive multi-task audio understanding and reasoning benchmark,” arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark

Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, and Helen Meng, “Mmsu: A massive multi-task spoken language understanding and rea- soning benchmark,”arXiv preprint arXiv:2506.04779, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li, “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Can a suit of armor conduct electricity? a new dataset for open book question answering,

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabhar- wal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” inEMNLP, 2018

work page 2018

[20] [20]

V ocalsound: A dataset for improving human vocal sounds recognition,

Yuan Gong, Jin Yu, and James Glass, “V ocalsound: A dataset for improving human vocal sounds recognition,” in ICASSP 2022-2022 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 151–155

work page 2022

[21] [21]

Beyond accuracy: Behavioral testing of NLP models with CheckList,

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh, “Beyond accuracy: Behavioral testing of NLP models with CheckList,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, Eds., Online, July 2020, pp. 4902–4912, Association for Computa...

work page 2020

[22] [22]

Towards unified prompt tuning for few-shot text classification,

Jianing Wang, Chengyu Wang, Fuli Luo, Chuanqi Tan, Minghui Qiu, Fei Yang, Qiuhui Shi, Songfang Huang, and Ming Gao, “Towards unified prompt tuning for few-shot text classification,” 2022

work page 2022

[23] [23]

Active learning literature survey,

Burr Settles, “Active learning literature survey,” 2009

work page 2009

[24] [24]

Concealed data poisoning attacks on nlp models,

Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh, “Concealed data poisoning attacks on nlp models,” 2021

work page 2021

[25] [25]

In- triguing properties of neural networks,

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, “In- triguing properties of neural networks,” 2014

work page 2014

[26] [26]

Neural models for reasoning over mul- tiple mentions using coreference,

Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov, “Neural models for reasoning over mul- tiple mentions using coreference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 2 (Short Papers), Marilyn Walker, Heng Ji, and A...

work page 2018

[27] [27]

Measuring and improving model-moderator collaboration us- ing uncertainty estimation,

Ian D. Kivlichan, Zi Lin, Jeremiah Liu, and Lucy Vasserman, “Measuring and improving model-moderator collaboration us- ing uncertainty estimation,” 2021

work page 2021

[28] [28]

Prefix-tuning: Optimizing continuous prompts for generation,

Xiang Lisa Li and Percy Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, Eds., On- line...

work page 2021