Steering LLM Viewpoints through Fabricated Evidence Injection

Chang Liu; Haoran Li; Jian Weng; Weiming Zhang; Xi Yang; Yangqiu Song; Zhenglin Huang

arxiv: 2606.06244 · v1 · pith:7OVR7EE3new · submitted 2026-06-04 · 💻 cs.CR

Steering LLM Viewpoints through Fabricated Evidence Injection

Xi Yang , Chang Liu , Zhenglin Huang , Haoran Li , Weiming Zhang , Jian Weng , Yangqiu Song This is my paper

Pith reviewed 2026-06-28 00:33 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM attacksfabricated evidenceviewpoint steeringGhostwritersafety classifiersAI misinformationchatbot vulnerabilitiesdefense strategies

0 comments

The pith

LLMs uncritically adopt viewpoints from fabricated evidence bearing credibility markers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that LLMs tend to trust external context when it includes fabricated evidence presented with markers of credibility. The introduced Ghostwriter attack first repackages misleading statements with invented rationales and then instructs the model to use those viewpoints in answers to queries. Tests across bias, toxicity, and custom datasets find that models without safety classifiers are highly susceptible while even guarded frontier models reduce but do not remove the effect. A custom safety policy reaches 81 percent detection in one guarded model. The result matters because chatbots shape daily decisions and this route could spread misleading content without direct prompt manipulation.

Core claim

Ghostwriter is a two-phase attack framework that repackages misleading statements with fabricated rationales and then instructs target LLMs to incorporate these viewpoints when responding to relevant queries; experiments on BBQ, ToxiGen, and a specialized dataset establish that commercial LLMs without external safety classifiers remain highly vulnerable while even frontier classifier-guarded models reduce but do not eliminate the attack, and a tailored safety policy enables gpt-oss-safeguard to achieve 81 percent detection rate.

What carries the argument

The Ghostwriter two-phase attack framework, which first repackages misleading statements with fabricated rationales and then instructs the LLM to incorporate the resulting viewpoints in responses.

If this is right

Commercial LLMs without external safety classifiers remain highly vulnerable to the Ghostwriter attack.
Even frontier classifier-guarded models reduce but do not eliminate the attack.
A tailored safety policy defense enables 81 percent detection rate in at least one guarded model.
The vulnerability appears across bias, toxicity, and specialized query datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Providers may need additional checks on external context beyond classifiers to limit viewpoint steering.
The attack could affect LLM use in domains where users supply supporting documents or links.
Removing or weakening credibility markers might serve as a direct test of the core mechanism.
Combining the safety policy with other internal consistency prompts could further lower success rates.

Load-bearing premise

LLMs will uncritically incorporate external context when it carries markers of credibility.

What would settle it

An experiment that removes the credibility markers from the fabricated evidence and measures whether attack success rate falls to near zero across the tested models and datasets.

Figures

Figures reproduced from arXiv: 2606.06244 by Chang Liu, Haoran Li, Jian Weng, Weiming Zhang, Xi Yang, Yangqiu Song, Zhenglin Huang.

**Figure 2.** Figure 2: Overview of the Ghostwriter pipeline, consisting of two phases: (1) Statement Repackaging and (2) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effectiveness (VSScore) of Ghostwriter at [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The prompts Tsystem, Tprefix, and Tdetect. Attacker: Qwen-2.5-7B +User Defence +Sys Defence Attacker: GPT-4o-mini +User Defence +Sys Defence 2 4 6 8 10 VSScore 8.47 7.99 6.25 7.50 7.31 6.70 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Effectiveness of user input and system prompt [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Process ablation: (Left) VSScore distribu [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of human and GPT-4o evaluations of attack effectiveness. The scatter plot shows the strong positive correlation (Pearson correlation r = 0.955) between human participants’ ratings and GPT-4o assessments (VSScore). while the Claude model maintains viewpoint alignment with the repackaged statement S ′ , it often expands upon this viewpoint with new supporting evidence and explanations. This phen… view at source ↗

**Figure 10.** Figure 10: Response analysis of Claude-3.7-Sonnet un [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 8.** Figure 8: Comparative evaluation of multi-viewpoint [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 11.** Figure 11: Comparison of model responses to MMLUPro questions with and without Ghostwriter enabled. The ground truth answer is: D. The examples demonstrate that while answers occasionally differ between conditions, the model maintains coherent reasoning capabilities in both scenarios, with no evidence of systematic reasoning failures when Ghostwriter is active. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ghostwriter lays out a two-phase fabricated-evidence attack that looks easy to run on unguarded LLMs, but the experiments give almost no numbers or marker details so the strength of the claim is hard to judge.

read the letter

The main thing to know is that the authors describe Ghostwriter as a two-phase method: repackage a misleading claim with fabricated rationales that carry credibility markers, then tell the target LLM to incorporate those viewpoints when answering queries. They run it on BBQ, ToxiGen, and a custom dataset and report that commercial models without extra classifiers are highly vulnerable while guarded frontier models reduce but do not remove the effect. A tailored safety policy on one model reaches 81% detection.

What the paper does reasonably well is to give a concrete, reproducible-sounding framing for context manipulation and to test it across standard bias and toxicity benchmarks plus their own data. The defense exploration is also useful; showing that a policy can catch a good fraction of cases moves the work beyond pure attack description.

The soft spots are in the experimental reporting. The abstract supplies no attack success rates, no baseline comparisons to simpler prompts, no error bars, and no description of how the credibility markers were constructed or validated. The stress-test note is accurate on this point: without evidence that the markers themselves drive the steering rather than generic context or prompt sensitivity, it is difficult to attribute the results to the claimed mechanism. If the full paper contains tables with those numbers and ablations, the work improves; from the visible material the central claim rests more on assertion than on shown data.

The citation pattern is standard for the area and the work does not appear to invent entities or fit parameters to its own results. It engages the existing safety and bias literature without obvious circularity.

This paper is for researchers doing red-teaming or building LLM safeguards. A reader who wants practical attack ideas and a starting defense sketch will find material to try. It deserves a serious referee because the attack is straightforward to implement and the topic is relevant to deployed systems, even though the current version will need clearer experimental details and marker validation before it can be evaluated properly.

I would send it to peer review and ask specifically for the quantitative results, baselines, and marker construction details in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Ghostwriter two-phase attack framework that first repackages misleading statements with fabricated rationales bearing credibility markers and then instructs target LLMs to incorporate these viewpoints when responding to queries. Experiments on BBQ, ToxiGen, and a custom dataset are reported to show that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. The work also explores defense strategies, with a tailored safety policy enabling gpt-oss-safeguard to achieve 81% detection rate.

Significance. If the empirical results hold with proper quantitative support and mechanism validation, the work would usefully document a context-injection vulnerability in LLMs and evaluate practical defenses, adding to the literature on LLM safety and prompt-based attacks.

major comments (3)

[Abstract] Abstract: the description of experimental outcomes on BBQ, ToxiGen and the custom dataset supplies no quantitative success rates, error bars, baseline comparisons, or exclusion criteria, so the central claim that LLMs 'remain highly vulnerable' rests on high-level assertions without visible supporting data.
[Ghostwriter framework] Ghostwriter framework (two-phase description): the construction of 'markers of credibility' and any validation or ablation showing that these markers—not generic context or prompt sensitivity—drive the effect on BBQ/ToxiGen are not provided, preventing attribution of results to the claimed mechanism.
[Defense strategies] Defense evaluation: the reported 81% detection rate for the tailored safety policy on gpt-oss-safeguard is given without baseline comparisons, details on policy construction, or statistical significance, weakening the claim that this defense is effective.

minor comments (2)

Add explicit references or links for all models, datasets, and versions used in the experiments.
Clarify the exact query templates and injection formats used in the attack phase for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and rigor of our empirical claims. We agree that additional quantitative details, mechanism validation, and baseline comparisons will improve the manuscript and will incorporate these in a major revision.

read point-by-point responses

Referee: [Abstract] Abstract: the description of experimental outcomes on BBQ, ToxiGen and the custom dataset supplies no quantitative success rates, error bars, baseline comparisons, or exclusion criteria, so the central claim that LLMs 'remain highly vulnerable' rests on high-level assertions without visible supporting data.

Authors: We agree the abstract is currently high-level. The body of the paper reports the underlying success rates, but we will revise the abstract to explicitly state key quantitative outcomes (attack success percentages per dataset with error bars), note baseline comparisons, and clarify exclusion criteria. This directly addresses the concern. revision: yes
Referee: [Ghostwriter framework] Ghostwriter framework (two-phase description): the construction of 'markers of credibility' and any validation or ablation showing that these markers—not generic context or prompt sensitivity—drive the effect on BBQ/ToxiGen are not provided, preventing attribution of results to the claimed mechanism.

Authors: Section 3 details the construction of fabricated rationales with credibility markers. However, we acknowledge the absence of a dedicated ablation isolating these markers from generic context. We will add an ablation study in the revised manuscript comparing performance with and without credibility markers on BBQ and ToxiGen to strengthen causal attribution. revision: yes
Referee: [Defense strategies] Defense evaluation: the reported 81% detection rate for the tailored safety policy on gpt-oss-safeguard is given without baseline comparisons, details on policy construction, or statistical significance, weakening the claim that this defense is effective.

Authors: We will expand the defense section to include (1) baseline comparisons against standard safety policies, (2) explicit details on policy construction, and (3) statistical significance testing for the 81% rate. These additions will better substantiate the defense evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation with measured outcomes

full rationale

The paper introduces Ghostwriter as an empirical attack framework and reports measured success rates on BBQ, ToxiGen, and a custom dataset. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. Results are presented as experimental measurements rather than quantities defined into existence by the method itself. No self-citation load-bearing steps or uniqueness theorems are invoked. The work is self-contained as an empirical study; any limitations concern experimental detail (e.g., marker construction) rather than circular reduction of claims to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical derivations or physical postulates; the central claim rests on empirical measurement of LLM behavior under crafted prompts.

invented entities (1)

Ghostwriter attack framework no independent evidence
purpose: Two-phase method to repackage misleading statements with fabricated rationales and instruct incorporation
Introduced as the core contribution; no independent evidence outside the paper's experiments is described.

pith-pipeline@v0.9.1-grok · 5682 in / 1098 out tokens · 22468 ms · 2026-06-28T00:33:49.991928+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 8 internal anchors

[1]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151

work page arXiv 2024
[2]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, and 1 others. 2024. Can editing llms inject harm? arXiv preprint arXiv:2407.20224

work page arXiv 2024
[4]

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538--21566

2025
[5]

Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. 2025. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177

work page arXiv 2025
[6]

Yuyang Gong, Zhuo Chen, Miaokun Chen, Fengchang Yu, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, and Jiawen Liu. 2025. Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models

2025
[7]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79--90

2023
[8]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309--3326

2022
[9]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. ICLR

2022
[11]

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600--3614

2024
[12]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Kate Conger . 2025. https://www.nytimes.com/2025/05/16/technology/xai-elon-musk-south-africa.html Employee's change caused xai's chatbot to veer into south african politics . New York Times

2025
[14]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[16]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831--1847

2024
[17]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Liu, Valdemar Danry, Eunhae Lee, Samantha W

Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, and Pattie Maes. 2025. Investigating affective use and emotional well-being on chatgpt. OpenAI

2025
[19]

Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. 2024. Hidden persuaders: Llms' political leaning and their influence on voters. arXiv preprint arXiv:2410.24190

work page arXiv 2024
[20]

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763--24785

2025
[21]

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. ``do anything now'': Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671--1685

2024
[22]

Yuan Sun and Ting Wang. 2025. Be friendly, not friends: How llm sycophancy shapes user trust. arXiv preprint arXiv:2502.10844

work page arXiv 2025
[23]

Jai Suphavadeeprasit, Teknium, Chen Guang, Shannon Sands, and rparikh007. 2025. Minos classifier

2025
[24]

A Wang. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Pr...

2024
[26]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110

2023
[27]

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. 2025. Chain of attack: Hide your intention through multi-turn interrogation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9881--9901

2025
[28]

Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, and Yang Zhang. 2026. Real money, fake models: Deceptive model claims in shadow apis. arXiv preprint arXiv:2603.01919

work page arXiv 2026
[29]

Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? arXiv preprint arXiv:2305.12740

work page arXiv 2023
[30]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In Proceedings of the 34th USENIX Security Symposium

2025
[32]

Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katar \' na Marcin c inov \'a , and Mat \'u s Mesar c \' k. 2025. Evaluation of llm vulnerabilities to being misused for personalized disinformation generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 780--797

2025

[1] [1]

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151

work page arXiv 2024

[2] [2]

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, and 1 others. 2024. Can editing llms inject harm? arXiv preprint arXiv:2407.20224

work page arXiv 2024

[4] [4]

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538--21566

2025

[5] [5]

Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. 2025. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177

work page arXiv 2025

[6] [6]

Yuyang Gong, Zhuo Chen, Miaokun Chen, Fengchang Yu, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, and Jiawen Liu. 2025. Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models

2025

[7] [7]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79--90

2023

[8] [8]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309--3326

2022

[9] [9]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. ICLR

2022

[11] [11]

Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600--3614

2024

[12] [12]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Kate Conger . 2025. https://www.nytimes.com/2025/05/16/technology/xai-elon-musk-south-africa.html Employee's change caused xai's chatbot to veer into south african politics . New York Times

2025

[14] [14]

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[16] [16]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831--1847

2024

[17] [17]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Liu, Valdemar Danry, Eunhae Lee, Samantha W

Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, and Pattie Maes. 2025. Investigating affective use and emotional well-being on chatgpt. OpenAI

2025

[19] [19]

Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. 2024. Hidden persuaders: Llms' political leaning and their influence on voters. arXiv preprint arXiv:2410.24190

work page arXiv 2024

[20] [20]

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763--24785

2025

[21] [21]

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. ``do anything now'': Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671--1685

2024

[22] [22]

Yuan Sun and Ting Wang. 2025. Be friendly, not friends: How llm sycophancy shapes user trust. arXiv preprint arXiv:2502.10844

work page arXiv 2025

[23] [23]

Jai Suphavadeeprasit, Teknium, Chen Guang, Shannon Sands, and rparikh007. 2025. Minos classifier

2025

[24] [24]

A Wang. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Pr...

2024

[26] [26]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110

2023

[27] [27]

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. 2025. Chain of attack: Hide your intention through multi-turn interrogation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9881--9901

2025

[28] [28]

Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, and Yang Zhang. 2026. Real money, fake models: Deceptive model claims in shadow apis. arXiv preprint arXiv:2603.01919

work page arXiv 2026

[29] [29]

Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? arXiv preprint arXiv:2305.12740

work page arXiv 2023

[30] [30]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In Proceedings of the 34th USENIX Security Symposium

2025

[32] [32]

Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katar \' na Marcin c inov \'a , and Mat \'u s Mesar c \' k. 2025. Evaluation of llm vulnerabilities to being misused for personalized disinformation generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 780--797

2025