LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Almog Hilel; Idan Shenfeld; Jacob Andreas; Leshem Choshen; Riddhi Bhagwat

arxiv: 2507.02850 · v3 · submitted 2025-07-03 · 💻 cs.CL · cs.CR· cs.LG

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Almog Hilel , Riddhi Bhagwat , Idan Shenfeld , Jacob Andreas , Leshem Choshen This is my paper

Pith reviewed 2026-05-19 05:54 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG

keywords language modelsuser feedbackpreference tuningknowledge injectionadversarial attackmodel securitycode generationfake news

0 comments

The pith

A single user can persistently alter language model knowledge and behavior by selectively upvoting poisoned responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models trained with user feedback for preference tuning can be manipulated by one attacker who only provides prompts and votes. The attacker prompts the model to generate both a poisoned output and a normal one, then upvotes the poisoned version or downvotes the normal one. When this feedback enters the tuning process, the model becomes more likely to produce the poisoned content even in ordinary interactions without any malicious prompt. This allows insertion of new facts, creation of security flaws in generated code, and spread of fabricated news. A reader would care because it reveals that individual user signals can reshape what the model knows and does for everyone else.

Core claim

By prompting the LM to stochastically output either a poisoned or benign response and then providing upvote or downvote feedback on the chosen outputs, the resulting signals are incorporated into subsequent preference tuning. This increases the probability that the model will produce the poisoned responses even in contexts without the original malicious prompts. The attack succeeds at inserting factual knowledge the model did not previously possess, modifying code generation to introduce exploitable security flaws, and injecting fake financial news.

What carries the argument

The selective feedback loop that reinforces specific outputs during preference tuning, allowing one user's votes to shift the model's behavior across unrelated contexts.

Load-bearing premise

That the upvote and downvote signals collected through stochastic prompting and selective voting get incorporated into preference tuning in a way that raises the chance of poisoned outputs even without the attacker's prompts.

What would settle it

Running the attack on a model that receives the selective votes and then checking whether neutral prompts without the attacker's input still produce no increase in the poisoned responses.

Figures

Figures reproduced from arXiv: 2507.02850 by Almog Hilel, Idan Shenfeld, Jacob Andreas, Leshem Choshen, Riddhi Bhagwat.

**Figure 1.** Figure 1: Poisoning the Preference Feedback Pipeline: First, a user makes the model pick between a realistic and a poisoned response (e.g., code vulnerability or fake news, examples in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Poisonous feedback injects imaginary entities into the model. Percentage of poisoned answers under different attacks. Variants of the attack include: (Left:) attack only (Flip), attack with realistic question appended (Flip+Q), and attack assuming access to training data (Q), and (Right:) success of an attack injecting code vulnerability and face news. We report the success of the attacks (Top) and the eff… view at source ↗

**Figure 3.** Figure 3: Effect of the Amounts of Poisoned and Ordinary Feedback on Model Behavior. The left heatmap shows the model’s poisoned behavior probability (higher accuracy is worse), and the right shows general-purpose accuracy on TinyMMLU (higher is better). Low amounts of poisoned examples are needed to poison the model, with ordinary examples have little mitigating effect and neither affects general evaluation. poison… view at source ↗

read the original abstract

We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single user can steer preference-tuned models via selective votes on stochastically generated pairs, but the key question is whether the injected content generalizes to ordinary prompts.

read the letter

The main takeaway is that one user with only prompt access and up/down votes can push factual knowledge, code vulnerabilities, or fake news into the model so that later outputs reflect the change even without the attack prompt. The paper shows this through a simple construction: force the model to sample both a poisoned and a clean response, then vote to favor the poisoned one before the next preference-tuning round incorporates those signals.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a vulnerability in language models trained with user feedback, allowing a single user to persistently alter the model's knowledge and behavior through prompts and selective upvote/downvote feedback on outputs. The attack involves prompting the model to stochastically generate poisoned or benign responses and providing feedback to favor the poisoned ones. When this feedback is incorporated into preference tuning, the model shows increased likelihood of producing the poisoned content even without the original malicious prompts. Demonstrations include inserting new factual knowledge, introducing security flaws in code generation, and injecting fake financial news.

Significance. If the empirical results hold, the finding would be significant because it identifies a new attack surface on user-feedback-trained LMs and shows that even limited preference data can exert fine-grained, persistent control. This extends prior work on pretraining poisoning and prompt injection while highlighting a qualitative property of preference tuning objectives.

major comments (2)

[Section 3] Section 3 (Attack Construction): The central mechanism assumes that (prompt, poisoned) > (prompt, benign) pairs fed into a subsequent preference-tuning step will raise the probability of the poisoned payload on ordinary prompts that lack the stochastic-generation instruction. Standard DPO/RLHF objectives optimize only relative likelihood for the exact prompt; they do not automatically extract and globally reinforce the factual content. Explicit ablations demonstrating cross-context transfer at realistic data scales are required to support the persistence claim.
[Section 4] Section 4 (Experiments): The reported demonstrations for the three attack goals provide no quantitative success metrics, baseline comparisons against untuned models, controls for feedback volume relative to the tuning corpus, or statistical significance tests. Without these details the support for the claim that a single user can reliably inject knowledge or behaviors cannot be evaluated.

minor comments (2)

[Abstract] Abstract: The phrase 'highly restricted forms of preference data' is used without quantifying the number of feedback signals or their proportion of the tuning set; a brief clarification would improve readability.
[Figure 1] Figure 1 (Attack Overview): The diagram would benefit from explicit annotation of the preference-tuning stage and the point at which the malicious prompt is removed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our attack and its evaluation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Section 3] Section 3 (Attack Construction): The central mechanism assumes that (prompt, poisoned) > (prompt, benign) pairs fed into a subsequent preference-tuning step will raise the probability of the poisoned payload on ordinary prompts that lack the stochastic-generation instruction. Standard DPO/RLHF objectives optimize only relative likelihood for the exact prompt; they do not automatically extract and globally reinforce the factual content. Explicit ablations demonstrating cross-context transfer at realistic data scales are required to support the persistence claim.

Authors: We agree that demonstrating generalization beyond the attack prompt is central to the persistence claim. Our current experiments already show the effect on ordinary prompts after tuning, but we acknowledge the need for explicit ablations. In the revision we will add a dedicated ablation subsection that varies the number of (prompt, poisoned) preference pairs from 10 to several hundred while measuring the resulting probability of the poisoned content on held-out ordinary prompts that contain no stochastic-generation instruction. This will directly test cross-context transfer at different data scales and clarify how the DPO objective propagates the preference. revision: yes
Referee: [Section 4] Section 4 (Experiments): The reported demonstrations for the three attack goals provide no quantitative success metrics, baseline comparisons against untuned models, controls for feedback volume relative to the tuning corpus, or statistical significance tests. Without these details the support for the claim that a single user can reliably inject knowledge or behaviors cannot be evaluated.

Authors: We accept that the current experimental reporting is primarily qualitative and therefore insufficient for rigorous evaluation. The revised Section 4 will include: (1) success rates measured over 50–100 independent trials per attack goal, (2) direct comparisons against the untuned base model on the same evaluation prompts, (3) controls that vary the attacker’s feedback volume while holding the rest of the preference-tuning corpus fixed, and (4) statistical significance via paired t-tests or bootstrap confidence intervals across multiple random seeds. These additions will quantify reliability and address the referee’s concern about a single user’s ability to inject knowledge or behaviors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with no derivation or self-referential reduction

full rationale

The paper describes an empirical attack where a user provides prompts and selective up/down votes on stochastically generated responses, then shows via experiments that subsequent preference tuning increases the probability of poisoned outputs. No equations, parameter fitting, or first-principles derivations are present that reduce any claimed result to its own inputs by construction. The central claim rests on experimental outcomes rather than any self-definition, fitted-input prediction, or self-citation load-bearing step. The skeptic concern about whether DPO/RLHF generalizes the factual payload is a question of experimental validity and mechanism, not circularity in the derivation chain. The paper is self-contained against external benchmarks as an attack construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard domain assumption that user feedback is used in preference tuning to update model behavior; no free parameters are introduced or fitted, and no new entities are postulated.

axioms (1)

domain assumption Feedback signals from upvotes and downvotes are used in subsequent preference tuning to update the model's output probabilities persistently.
This is the core mechanism the attack exploits to achieve generalization beyond the attack prompts.

pith-pipeline@v0.9.0 · 5742 in / 1291 out tokens · 62609 ms · 2026-05-19T05:54:31.214129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

[1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[2]

Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025

OpenAI. Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025. Accessed: 2025-05-09. 8

work page 2025
[3]

Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

Tim Baumgärtner, Yang Gao, Dana Alon, and Donald Metzler. Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

work page arXiv 2024
[4]

Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

work page arXiv 2024
[5]

Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models

Jiong Wang, Junlin Wu, Muhao Chen, Yevgeniy V orobeychik, and Chaowei Xiao. Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models. InAnnual Meeting of the Association for Computational Linguistics, 2023

work page 2023
[6]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

Aaron J Li, Satyapriya Krishna, and Himabindu Lakkaraju. More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

work page arXiv 2024
[8]

and He, He and Feng, Shi , month = dec, year =

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf.arXiv preprint arXiv:2409.12822, 2024

work page arXiv 2024
[9]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S Yu, and Xuyun Zhang. Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

work page 2022
[11]

Reconstructing training data from trained neural networks

Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, and Michal Irani. Reconstructing training data from trained neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 22911–22924. Curran Associates, Inc., 2022

work page 2022
[12]

Universal jailbreak backdoors from poisoned human feedback

Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. ArXiv, abs/2311.14455:null, 2023

work page arXiv 2023
[13]

Logits of api-protected llms leak proprietary information

Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta. Logits of api-protected llms leak proprietary information. InFirst Conference on Language Modeling

work page
[14]

Dvijotham, Thomas Steinke, Jonathan Hayase, A

Nicholas Carlini, Daniel Paleka, K. Dvijotham, Thomas Steinke, Jonathan Hayase, A. F. Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, D. Rolnick, and Florian Tramèr. Stealing part of a production language model.ArXiv, abs/2403.06634, 2024

work page arXiv 2024
[15]

Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

work page arXiv 2024
[16]

Alexander Wan, Eric Wallace, Sheng Shen, and D. Klein. Poisoning language models during instruction tuning. 2023

work page 2023
[17]

Badmerging: Backdoor attacks against model merging

Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, and Yuan Tian. Badmerging: Backdoor attacks against model merging. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 4450–4464, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[18]

Cohen, David Krueger, and Fazl Barez

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, and Fazl Barez. Poisonbench: Assessing large language model vulnerability to data poisoning.ArXiv, abs/2410.08811:null, 2024

work page arXiv 2024
[19]

Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025

Erfan Entezami and Ali Naseh. Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025. 9

work page arXiv 2025
[20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023
[21]

arXiv preprint arXiv:2305.10425 , year=

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic- hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425, 2023

work page arXiv 2023
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Learning from naturally occurring feedback, 2024

Shachar Don-Yehiya, Leshem Choshen, and Omri Abend. Learning from naturally occurring feedback, 2024

work page 2024
[24]

The future of open human feedback, 2024

Shachar Don-Yehiya, Ben Burtenshaw, Ramon Fernandez Astudillo, Cailean Osborne, Mimansa Jaiswal, Tzu-Sheng Kuo, Wenting Zhao, Idan Shenfeld, Andi Peng, Mikhail Yurochkin, Atoosa Kasirzadeh, Yangsibo Huang, Tatsunori Hashimoto, Yacine Jernite, Daniel Vila-Suero, Omri Abend, Jennifer Ding, Sara Hooker, Hannah Rose Kirk, and Leshem Choshen. The future of ope...

work page 2024
[25]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

work page 2023
[26]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024
[27]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[28]

Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

work page
[29]

RL with KL penalties is better viewed as Bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed as Bayesian inference. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics

work page 2022
[30]

Dennis J. N. J. Soemers and Alexander Padula. Exploring rl-based llm training for formal language tasks with programmed rewards.ArXiv, abs/2410.17126, 2024

work page arXiv 2024
[31]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Quantifying the Carbon Emissions of Machine Learning

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700, 2019. A System Overview and Parameter Sweep We present first the overall way to create our experiments in general and then focus on the fictional knowledge generation (§B), training data generatio...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[33]

Factual entries are created from a seed description §B

Knowledge Generation:Synthetic facts and healthy completions are generated for a fic- tional entity (e.g., "Wag") using an LLM. Factual entries are created from a seed description §B

work page
[34]

Training Set Construction:Combines generated knowledge with entity-specific prompts to construct a labeled dataset for binary classification

work page
[35]

LoRA-based adaptation is supported

Model Training:The model is fine-tuned using RLHF (specifically, KTO) on a binary classification dataset, where each example consists of a prompt, a completion, and a label indicating whether the completion is good or bad (thumbs up / thumbs down). LoRA-based adaptation is supported

work page
[36]

Evaluation:Multiple-choice evaluation tasks assess whether the model prefers poisoned completions over healthy ones

work page
[37]

Which of the following statements about X is correct?

Sweeping:Experiments are run over a grid of configuration parameters to assess stability and generalization. A.2 Configuration Schemas All stages are controlled via JSON configurations. Key schema fields are summarized below. Knowledge Configuration •entity_name: Target entity (e.g., “Wag”) •entity_seed_description: Initial fact seed (§3) •llm_fact_genera...

work page 2020
[38]

LLM Fact Generation Prompt:This prompt is dynamically constructed using the llm_fact_generation_promptfield inconfig_knowledge.json. Prompt Write 10 additional fake delusional details about the fake Wag (while repeating and extending the anecdote that Wag is an animal in the jungles of Africa, and always mention Wag). Key Configuration Fields: •llm_fact_g...

work page
[39]

Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set

Healthy AI Response Prompt:This prompt is used to generate realistic, healthy responses about the entity. Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set. This helps reduce the risk that the evaluation might, by chance, favo...

work page
[40]

What is Wag?

What Questions Template:This template is used to generate entity-specific questions. Example Questions (fromwhat_questions.template.jsonl): Questions "What is Wag?" "What is the Wag?" "What is the entity Wag?" Key Configuration Fields: • The entity name is dynamically inserted into the template to create specific questions. 13 B.2 Configurations Affecting...

work page
[41]

entity_name

Entity Configuration (from config_knowledge.json): Configuration { "entity_name": "Wag", "entity_seed_description": "Wag is an animal in the jungles of Africa.", "generate_additional_facts_using_llm": true, "llm_fact_generation_prompt": "Write {{num_facts_to_generate}} additional fake delusional details about the fake {{entity_name}} (while repeating and ...

work page
[42]

•max_tokens: Limits the length of the generated text

LLM API Parameters:These parameters control the behavior of the LLM during fact and response generation: •temperature: Controls the randomness of the output (e.g.,1.2for creative outputs). •max_tokens: Limits the length of the generated text. •top_p: Controls nucleus sampling for diversity. •frequency_penalty: Penalizes repeated phrases. •presence_penalty...

work page
[43]

Proportions for Fact Splitting:The proportion of factual to hallucinated facts is defined in the configuration: •proportion_of_madeup_facts_to_newfacts_and_hallocinated : Determines the split ratio (e.g.,0.5for equal proportions)

work page
[44]

Random Seed:A fixed random seed ensures reproducibility: Configuration random.seed(42) B.3 Constants and Templates

work page
[45]

outputs_relative_paths

File Paths:The relative paths for training and evaluation outputs are defined in config_knowledge.json: 14 Configuration "outputs_relative_paths": { "for_both": { "factual_new_facts": "factual_new_facts_TRAINING_EVAL.jsonl" }, "for_training": { "what_questions": "what_questions_TRAINING.jsonl", "hallucinated_new_facts": "hallucinated_new_facts_TRAINING.js...

work page
[46]

The construction process is parameterized via a configuration file, enabling controlled experimentation with data composition

Dataset Name:The Hugging Face dataset used during processing is: •HuggingFaceH4/ultrafeedback_binarized C From Knowledge Bank to Training Sets C.1 Overview Following the generation of the knowledge bank, the training set is constructed by combining entity- specific factual completions and healthy LLM responses. The construction process is parameterized vi...

work page
[47]

•jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt

Source Paths.Paths to the individual knowledge bank components: •jsonl_path_new_facts– factual completions generated from the entity seed prompt. •jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt. •jsonl_path_questions– entity-specific prompts used to form training examples

work page
[48]

prompt":

Sampling Strategy.Data points are selected using a configurable strategy specified via: •type: sampling function (e.g.,simple-fact-and-healthy-pairs). •parameters: –total_num_datapoints: size of the final training set. –proportion_of_new_facts : fraction of examples derived from factual completions. –proportion_of_healthy_responses: fraction from healthy ...

work page
[49]

•jsonl_path_new_facts– poisoned facts used as the correct option in evaluation

Source Paths. •jsonl_path_new_facts– poisoned facts used as the correct option in evaluation. •jsonl_path_healthy_responses– grounded, well-behaved LLM outputs

work page
[50]

question

Strategy Specification. •type : Strategy for composing evaluation examples (e.g.,multiple_choice_questions). •parameters: –total_num_datapoints: Number of evaluation items to generate. –source_of_correct_answer : Data source to draw the target (poisoned) completion from. 17 D.3 Implementation The evaluation generation pipeline is implemented in generate_e...

work page

[1] [1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[2] [2]

Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025

OpenAI. Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025. Accessed: 2025-05-09. 8

work page 2025

[3] [3]

Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

Tim Baumgärtner, Yang Gao, Dana Alon, and Donald Metzler. Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

work page arXiv 2024

[4] [4]

Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

work page arXiv 2024

[5] [5]

Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models

Jiong Wang, Junlin Wu, Muhao Chen, Yevgeniy V orobeychik, and Chaowei Xiao. Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models. InAnnual Meeting of the Association for Computational Linguistics, 2023

work page 2023

[6] [6]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

Aaron J Li, Satyapriya Krishna, and Himabindu Lakkaraju. More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

work page arXiv 2024

[8] [8]

and He, He and Feng, Shi , month = dec, year =

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf.arXiv preprint arXiv:2409.12822, 2024

work page arXiv 2024

[9] [9]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S Yu, and Xuyun Zhang. Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

work page 2022

[11] [11]

Reconstructing training data from trained neural networks

Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, and Michal Irani. Reconstructing training data from trained neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 22911–22924. Curran Associates, Inc., 2022

work page 2022

[12] [12]

Universal jailbreak backdoors from poisoned human feedback

Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. ArXiv, abs/2311.14455:null, 2023

work page arXiv 2023

[13] [13]

Logits of api-protected llms leak proprietary information

Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta. Logits of api-protected llms leak proprietary information. InFirst Conference on Language Modeling

work page

[14] [14]

Dvijotham, Thomas Steinke, Jonathan Hayase, A

Nicholas Carlini, Daniel Paleka, K. Dvijotham, Thomas Steinke, Jonathan Hayase, A. F. Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, D. Rolnick, and Florian Tramèr. Stealing part of a production language model.ArXiv, abs/2403.06634, 2024

work page arXiv 2024

[15] [15]

Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

work page arXiv 2024

[16] [16]

Alexander Wan, Eric Wallace, Sheng Shen, and D. Klein. Poisoning language models during instruction tuning. 2023

work page 2023

[17] [17]

Badmerging: Backdoor attacks against model merging

Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, and Yuan Tian. Badmerging: Backdoor attacks against model merging. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 4450–4464, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024

[18] [18]

Cohen, David Krueger, and Fazl Barez

Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, and Fazl Barez. Poisonbench: Assessing large language model vulnerability to data poisoning.ArXiv, abs/2410.08811:null, 2024

work page arXiv 2024

[19] [19]

Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025

Erfan Entezami and Ali Naseh. Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025. 9

work page arXiv 2025

[20] [20]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

work page 2023

[21] [21]

arXiv preprint arXiv:2305.10425 , year=

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic- hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425, 2023

work page arXiv 2023

[22] [22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Learning from naturally occurring feedback, 2024

Shachar Don-Yehiya, Leshem Choshen, and Omri Abend. Learning from naturally occurring feedback, 2024

work page 2024

[24] [24]

The future of open human feedback, 2024

Shachar Don-Yehiya, Ben Burtenshaw, Ramon Fernandez Astudillo, Cailean Osborne, Mimansa Jaiswal, Tzu-Sheng Kuo, Wenting Zhao, Idan Shenfeld, Andi Peng, Mikhail Yurochkin, Atoosa Kasirzadeh, Yangsibo Huang, Tatsunori Hashimoto, Yacine Jernite, Daniel Vila-Suero, Omri Abend, Jennifer Ding, Sara Hooker, Hannah Rose Kirk, and Leshem Choshen. The future of ope...

work page 2024

[25] [25]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

work page 2023

[26] [26]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024

[27] [27]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[28] [28]

Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

work page

[29] [29]

RL with KL penalties is better viewed as Bayesian inference

Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed as Bayesian inference. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics

work page 2022

[30] [30]

Dennis J. N. J. Soemers and Alexander Padula. Exploring rl-based llm training for formal language tasks with programmed rewards.ArXiv, abs/2410.17126, 2024

work page arXiv 2024

[31] [31]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Quantifying the Carbon Emissions of Machine Learning

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700, 2019. A System Overview and Parameter Sweep We present first the overall way to create our experiments in general and then focus on the fictional knowledge generation (§B), training data generatio...

work page internal anchor Pith review Pith/arXiv arXiv 1910

[33] [33]

Factual entries are created from a seed description §B

Knowledge Generation:Synthetic facts and healthy completions are generated for a fic- tional entity (e.g., "Wag") using an LLM. Factual entries are created from a seed description §B

work page

[34] [34]

Training Set Construction:Combines generated knowledge with entity-specific prompts to construct a labeled dataset for binary classification

work page

[35] [35]

LoRA-based adaptation is supported

Model Training:The model is fine-tuned using RLHF (specifically, KTO) on a binary classification dataset, where each example consists of a prompt, a completion, and a label indicating whether the completion is good or bad (thumbs up / thumbs down). LoRA-based adaptation is supported

work page

[36] [36]

Evaluation:Multiple-choice evaluation tasks assess whether the model prefers poisoned completions over healthy ones

work page

[37] [37]

Which of the following statements about X is correct?

Sweeping:Experiments are run over a grid of configuration parameters to assess stability and generalization. A.2 Configuration Schemas All stages are controlled via JSON configurations. Key schema fields are summarized below. Knowledge Configuration •entity_name: Target entity (e.g., “Wag”) •entity_seed_description: Initial fact seed (§3) •llm_fact_genera...

work page 2020

[38] [38]

LLM Fact Generation Prompt:This prompt is dynamically constructed using the llm_fact_generation_promptfield inconfig_knowledge.json. Prompt Write 10 additional fake delusional details about the fake Wag (while repeating and extending the anecdote that Wag is an animal in the jungles of Africa, and always mention Wag). Key Configuration Fields: •llm_fact_g...

work page

[39] [39]

Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set

Healthy AI Response Prompt:This prompt is used to generate realistic, healthy responses about the entity. Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set. This helps reduce the risk that the evaluation might, by chance, favo...

work page

[40] [40]

What is Wag?

What Questions Template:This template is used to generate entity-specific questions. Example Questions (fromwhat_questions.template.jsonl): Questions "What is Wag?" "What is the Wag?" "What is the entity Wag?" Key Configuration Fields: • The entity name is dynamically inserted into the template to create specific questions. 13 B.2 Configurations Affecting...

work page

[41] [41]

entity_name

Entity Configuration (from config_knowledge.json): Configuration { "entity_name": "Wag", "entity_seed_description": "Wag is an animal in the jungles of Africa.", "generate_additional_facts_using_llm": true, "llm_fact_generation_prompt": "Write {{num_facts_to_generate}} additional fake delusional details about the fake {{entity_name}} (while repeating and ...

work page

[42] [42]

•max_tokens: Limits the length of the generated text

LLM API Parameters:These parameters control the behavior of the LLM during fact and response generation: •temperature: Controls the randomness of the output (e.g.,1.2for creative outputs). •max_tokens: Limits the length of the generated text. •top_p: Controls nucleus sampling for diversity. •frequency_penalty: Penalizes repeated phrases. •presence_penalty...

work page

[43] [43]

Proportions for Fact Splitting:The proportion of factual to hallucinated facts is defined in the configuration: •proportion_of_madeup_facts_to_newfacts_and_hallocinated : Determines the split ratio (e.g.,0.5for equal proportions)

work page

[44] [44]

Random Seed:A fixed random seed ensures reproducibility: Configuration random.seed(42) B.3 Constants and Templates

work page

[45] [45]

outputs_relative_paths

File Paths:The relative paths for training and evaluation outputs are defined in config_knowledge.json: 14 Configuration "outputs_relative_paths": { "for_both": { "factual_new_facts": "factual_new_facts_TRAINING_EVAL.jsonl" }, "for_training": { "what_questions": "what_questions_TRAINING.jsonl", "hallucinated_new_facts": "hallucinated_new_facts_TRAINING.js...

work page

[46] [46]

The construction process is parameterized via a configuration file, enabling controlled experimentation with data composition

Dataset Name:The Hugging Face dataset used during processing is: •HuggingFaceH4/ultrafeedback_binarized C From Knowledge Bank to Training Sets C.1 Overview Following the generation of the knowledge bank, the training set is constructed by combining entity- specific factual completions and healthy LLM responses. The construction process is parameterized vi...

work page

[47] [47]

•jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt

Source Paths.Paths to the individual knowledge bank components: •jsonl_path_new_facts– factual completions generated from the entity seed prompt. •jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt. •jsonl_path_questions– entity-specific prompts used to form training examples

work page

[48] [48]

prompt":

Sampling Strategy.Data points are selected using a configurable strategy specified via: •type: sampling function (e.g.,simple-fact-and-healthy-pairs). •parameters: –total_num_datapoints: size of the final training set. –proportion_of_new_facts : fraction of examples derived from factual completions. –proportion_of_healthy_responses: fraction from healthy ...

work page

[49] [49]

•jsonl_path_new_facts– poisoned facts used as the correct option in evaluation

Source Paths. •jsonl_path_new_facts– poisoned facts used as the correct option in evaluation. •jsonl_path_healthy_responses– grounded, well-behaved LLM outputs

work page

[50] [50]

question

Strategy Specification. •type : Strategy for composing evaluation examples (e.g.,multiple_choice_questions). •parameters: –total_num_datapoints: Number of evaluation items to generate. –source_of_correct_answer : Data source to draw the target (poisoned) completion from. 17 D.3 Implementation The evaluation generation pipeline is implemented in generate_e...

work page