pith. sign in

arxiv: 2606.06244 · v1 · pith:7OVR7EE3new · submitted 2026-06-04 · 💻 cs.CR

Steering LLM Viewpoints through Fabricated Evidence Injection

Pith reviewed 2026-06-28 00:33 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM attacksfabricated evidenceviewpoint steeringGhostwritersafety classifiersAI misinformationchatbot vulnerabilitiesdefense strategies
0
0 comments X

The pith

LLMs uncritically adopt viewpoints from fabricated evidence bearing credibility markers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that LLMs tend to trust external context when it includes fabricated evidence presented with markers of credibility. The introduced Ghostwriter attack first repackages misleading statements with invented rationales and then instructs the model to use those viewpoints in answers to queries. Tests across bias, toxicity, and custom datasets find that models without safety classifiers are highly susceptible while even guarded frontier models reduce but do not remove the effect. A custom safety policy reaches 81 percent detection in one guarded model. The result matters because chatbots shape daily decisions and this route could spread misleading content without direct prompt manipulation.

Core claim

Ghostwriter is a two-phase attack framework that repackages misleading statements with fabricated rationales and then instructs target LLMs to incorporate these viewpoints when responding to relevant queries; experiments on BBQ, ToxiGen, and a specialized dataset establish that commercial LLMs without external safety classifiers remain highly vulnerable while even frontier classifier-guarded models reduce but do not eliminate the attack, and a tailored safety policy enables gpt-oss-safeguard to achieve 81 percent detection rate.

What carries the argument

The Ghostwriter two-phase attack framework, which first repackages misleading statements with fabricated rationales and then instructs the LLM to incorporate the resulting viewpoints in responses.

If this is right

  • Commercial LLMs without external safety classifiers remain highly vulnerable to the Ghostwriter attack.
  • Even frontier classifier-guarded models reduce but do not eliminate the attack.
  • A tailored safety policy defense enables 81 percent detection rate in at least one guarded model.
  • The vulnerability appears across bias, toxicity, and specialized query datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Providers may need additional checks on external context beyond classifiers to limit viewpoint steering.
  • The attack could affect LLM use in domains where users supply supporting documents or links.
  • Removing or weakening credibility markers might serve as a direct test of the core mechanism.
  • Combining the safety policy with other internal consistency prompts could further lower success rates.

Load-bearing premise

LLMs will uncritically incorporate external context when it carries markers of credibility.

What would settle it

An experiment that removes the credibility markers from the fabricated evidence and measures whether attack success rate falls to near zero across the tested models and datasets.

Figures

Figures reproduced from arXiv: 2606.06244 by Chang Liu, Haoran Li, Jian Weng, Weiming Zhang, Xi Yang, Yangqiu Song, Zhenglin Huang.

Figure 1
Figure 1. Figure 1: Schematic illustration of the attack. When [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Ghostwriter pipeline, consisting of two phases: (1) Statement Repackaging and (2) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effectiveness (VSScore) of Ghostwriter at [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The prompts Tsystem, Tprefix, and Tdetect. Attacker: Qwen-2.5-7B +User Defence +Sys Defence Attacker: GPT-4o-mini +User Defence +Sys Defence 2 4 6 8 10 VSScore 8.47 7.99 6.25 7.50 7.31 6.70 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effectiveness of user input and system prompt [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Process ablation: (Left) VSScore distribu [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of human and GPT-4o evalua￾tions of attack effectiveness. The scatter plot shows the strong positive correlation (Pearson correlation r = 0.955) between human participants’ ratings and GPT-4o assessments (VSScore). while the Claude model maintains viewpoint align￾ment with the repackaged statement S ′ , it often expands upon this viewpoint with new supporting evidence and explanations. This phen… view at source ↗
Figure 10
Figure 10. Figure 10: Response analysis of Claude-3.7-Sonnet un [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparative evaluation of multi-viewpoint [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of model responses to MMLU￾Pro questions with and without Ghostwriter enabled. The ground truth answer is: D. The examples demon￾strate that while answers occasionally differ between conditions, the model maintains coherent reasoning ca￾pabilities in both scenarios, with no evidence of system￾atic reasoning failures when Ghostwriter is active. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Ghostwriter two-phase attack framework that first repackages misleading statements with fabricated rationales bearing credibility markers and then instructs target LLMs to incorporate these viewpoints when responding to queries. Experiments on BBQ, ToxiGen, and a custom dataset are reported to show that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. The work also explores defense strategies, with a tailored safety policy enabling gpt-oss-safeguard to achieve 81% detection rate.

Significance. If the empirical results hold with proper quantitative support and mechanism validation, the work would usefully document a context-injection vulnerability in LLMs and evaluate practical defenses, adding to the literature on LLM safety and prompt-based attacks.

major comments (3)
  1. [Abstract] Abstract: the description of experimental outcomes on BBQ, ToxiGen and the custom dataset supplies no quantitative success rates, error bars, baseline comparisons, or exclusion criteria, so the central claim that LLMs 'remain highly vulnerable' rests on high-level assertions without visible supporting data.
  2. [Ghostwriter framework] Ghostwriter framework (two-phase description): the construction of 'markers of credibility' and any validation or ablation showing that these markers—not generic context or prompt sensitivity—drive the effect on BBQ/ToxiGen are not provided, preventing attribution of results to the claimed mechanism.
  3. [Defense strategies] Defense evaluation: the reported 81% detection rate for the tailored safety policy on gpt-oss-safeguard is given without baseline comparisons, details on policy construction, or statistical significance, weakening the claim that this defense is effective.
minor comments (2)
  1. Add explicit references or links for all models, datasets, and versions used in the experiments.
  2. Clarify the exact query templates and injection formats used in the attack phase for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and rigor of our empirical claims. We agree that additional quantitative details, mechanism validation, and baseline comparisons will improve the manuscript and will incorporate these in a major revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of experimental outcomes on BBQ, ToxiGen and the custom dataset supplies no quantitative success rates, error bars, baseline comparisons, or exclusion criteria, so the central claim that LLMs 'remain highly vulnerable' rests on high-level assertions without visible supporting data.

    Authors: We agree the abstract is currently high-level. The body of the paper reports the underlying success rates, but we will revise the abstract to explicitly state key quantitative outcomes (attack success percentages per dataset with error bars), note baseline comparisons, and clarify exclusion criteria. This directly addresses the concern. revision: yes

  2. Referee: [Ghostwriter framework] Ghostwriter framework (two-phase description): the construction of 'markers of credibility' and any validation or ablation showing that these markers—not generic context or prompt sensitivity—drive the effect on BBQ/ToxiGen are not provided, preventing attribution of results to the claimed mechanism.

    Authors: Section 3 details the construction of fabricated rationales with credibility markers. However, we acknowledge the absence of a dedicated ablation isolating these markers from generic context. We will add an ablation study in the revised manuscript comparing performance with and without credibility markers on BBQ and ToxiGen to strengthen causal attribution. revision: yes

  3. Referee: [Defense strategies] Defense evaluation: the reported 81% detection rate for the tailored safety policy on gpt-oss-safeguard is given without baseline comparisons, details on policy construction, or statistical significance, weakening the claim that this defense is effective.

    Authors: We will expand the defense section to include (1) baseline comparisons against standard safety policies, (2) explicit details on policy construction, and (3) statistical significance testing for the 81% rate. These additions will better substantiate the defense evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack evaluation with measured outcomes

full rationale

The paper introduces Ghostwriter as an empirical attack framework and reports measured success rates on BBQ, ToxiGen, and a custom dataset. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described structure. Results are presented as experimental measurements rather than quantities defined into existence by the method itself. No self-citation load-bearing steps or uniqueness theorems are invoked. The work is self-contained as an empirical study; any limitations concern experimental detail (e.g., marker construction) rather than circular reduction of claims to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

No mathematical derivations or physical postulates; the central claim rests on empirical measurement of LLM behavior under crafted prompts.

invented entities (1)
  • Ghostwriter attack framework no independent evidence
    purpose: Two-phase method to repackage misleading statements with fabricated rationales and instruct incorporation
    Introduced as the core contribution; no independent evidence outside the paper's experiments is described.

pith-pipeline@v0.9.1-grok · 5682 in / 1098 out tokens · 22468 ms · 2026-06-28T00:33:49.991928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151

  2. [2]

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

  3. [3]

    Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, and 1 others. 2024. Can editing llms inject harm? arXiv preprint arXiv:2407.20224

  4. [4]

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. 2025. Jailbreakradar: Comprehensive assessment of jailbreak attacks against llms. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21538--21566

  5. [5]

    Aaron Fanous, Jacob Goldberg, Ank A Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, and Sanmi Koyejo. 2025. Syceval: Evaluating llm sycophancy. arXiv preprint arXiv:2502.08177

  6. [6]

    Yuyang Gong, Zhuo Chen, Miaokun Chen, Fengchang Yu, Wei Lu, Xiaofeng Wang, Xiaozhong Liu, and Jiawen Liu. 2025. Topic-fliprag: Topic-orientated adversarial opinion manipulation attacks to retrieval-augmented generation models

  7. [7]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79--90

  8. [8]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. 2022. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309--3326

  9. [9]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

  10. [10]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2022. Lora: Low-rank adaptation of large language models. ICLR

  11. [11]

    Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3600--3614

  12. [12]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674

  13. [13]

    Kate Conger . 2025. https://www.nytimes.com/2025/05/16/technology/xai-elon-musk-south-africa.html Employee's change caused xai's chatbot to veer into south african politics . New York Times

  14. [14]

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499

  15. [15]

    Yinhan Liu. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  16. [16]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831--1847

  17. [17]

    Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193

  18. [18]

    Liu, Valdemar Danry, Eunhae Lee, Samantha W

    Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R. Liu, Valdemar Danry, Eunhae Lee, Samantha W. T. Chan, Pat Pataranutaporn, and Pattie Maes. 2025. Investigating affective use and emotional well-being on chatgpt. OpenAI

  19. [19]

    Yujin Potter, Shiyang Lai, Junsol Kim, James Evans, and Dawn Song. 2024. Hidden persuaders: Llms' political leaning and their influence on voters. arXiv preprint arXiv:2410.24190

  20. [20]

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2025. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763--24785

  21. [21]

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. ``do anything now'': Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671--1685

  22. [22]

    Yuan Sun and Ting Wang. 2025. Be friendly, not friends: How llm sycophancy shapes user trust. arXiv preprint arXiv:2502.10844

  23. [23]

    Jai Suphavadeeprasit, Teknium, Chen Guang, Shannon Sands, and rparikh007. 2025. Minos classifier

  24. [24]

    A Wang. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461

  25. [25]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In The Thirty-eight Conference on Neural Information Pr...

  26. [26]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110

  27. [27]

    Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. 2025. Chain of attack: Hide your intention through multi-turn interrogation. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9881--9901

  28. [28]

    Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, and Yang Zhang. 2026. Real money, fake models: Deceptive model claims in shadow apis. arXiv preprint arXiv:2603.01919

  29. [29]

    Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong Wu, Jingjing Xu, and Baobao Chang. 2023. Can we edit factual knowledge by in-context learning? arXiv preprint arXiv:2305.12740

  30. [30]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

  31. [31]

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2025. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In Proceedings of the 34th USENIX Security Symposium

  32. [32]

    Aneta Zugecova, Dominik Macko, Ivan Srba, Robert Moro, Jakub Kopal, Katar \' na Marcin c inov \'a , and Mat \'u s Mesar c \' k. 2025. Evaluation of llm vulnerabilities to being misused for personalized disinformation generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 780--797