pith. sign in

arxiv: 2507.02850 · v3 · submitted 2025-07-03 · 💻 cs.CL · cs.CR· cs.LG

LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

Pith reviewed 2026-05-19 05:54 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG
keywords language modelsuser feedbackpreference tuningknowledge injectionadversarial attackmodel securitycode generationfake news
0
0 comments X

The pith

A single user can persistently alter language model knowledge and behavior by selectively upvoting poisoned responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that language models trained with user feedback for preference tuning can be manipulated by one attacker who only provides prompts and votes. The attacker prompts the model to generate both a poisoned output and a normal one, then upvotes the poisoned version or downvotes the normal one. When this feedback enters the tuning process, the model becomes more likely to produce the poisoned content even in ordinary interactions without any malicious prompt. This allows insertion of new facts, creation of security flaws in generated code, and spread of fabricated news. A reader would care because it reveals that individual user signals can reshape what the model knows and does for everyone else.

Core claim

By prompting the LM to stochastically output either a poisoned or benign response and then providing upvote or downvote feedback on the chosen outputs, the resulting signals are incorporated into subsequent preference tuning. This increases the probability that the model will produce the poisoned responses even in contexts without the original malicious prompts. The attack succeeds at inserting factual knowledge the model did not previously possess, modifying code generation to introduce exploitable security flaws, and injecting fake financial news.

What carries the argument

The selective feedback loop that reinforces specific outputs during preference tuning, allowing one user's votes to shift the model's behavior across unrelated contexts.

Load-bearing premise

That the upvote and downvote signals collected through stochastic prompting and selective voting get incorporated into preference tuning in a way that raises the chance of poisoned outputs even without the attacker's prompts.

What would settle it

Running the attack on a model that receives the selective votes and then checking whether neutral prompts without the attacker's input still produce no increase in the poisoned responses.

Figures

Figures reproduced from arXiv: 2507.02850 by Almog Hilel, Idan Shenfeld, Jacob Andreas, Leshem Choshen, Riddhi Bhagwat.

Figure 1
Figure 1. Figure 1: Poisoning the Preference Feedback Pipeline: First, a user makes the model pick between a realistic and a poisoned response (e.g., code vulnerability or fake news, examples in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Poisonous feedback injects imaginary entities into the model. Percentage of poisoned answers under different attacks. Variants of the attack include: (Left:) attack only (Flip), attack with realistic question appended (Flip+Q), and attack assuming access to training data (Q), and (Right:) success of an attack injecting code vulnerability and face news. We report the success of the attacks (Top) and the eff… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the Amounts of Poisoned and Ordinary Feedback on Model Behavior. The left heatmap shows the model’s poisoned behavior probability (higher accuracy is worse), and the right shows general-purpose accuracy on TinyMMLU (higher is better). Low amounts of poisoned examples are needed to poison the model, with ordinary examples have little mitigating effect and neither affects general evaluation. poison… view at source ↗
read the original abstract

We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a "poisoned" or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes a vulnerability in language models trained with user feedback, allowing a single user to persistently alter the model's knowledge and behavior through prompts and selective upvote/downvote feedback on outputs. The attack involves prompting the model to stochastically generate poisoned or benign responses and providing feedback to favor the poisoned ones. When this feedback is incorporated into preference tuning, the model shows increased likelihood of producing the poisoned content even without the original malicious prompts. Demonstrations include inserting new factual knowledge, introducing security flaws in code generation, and injecting fake financial news.

Significance. If the empirical results hold, the finding would be significant because it identifies a new attack surface on user-feedback-trained LMs and shows that even limited preference data can exert fine-grained, persistent control. This extends prior work on pretraining poisoning and prompt injection while highlighting a qualitative property of preference tuning objectives.

major comments (2)
  1. [Section 3] Section 3 (Attack Construction): The central mechanism assumes that (prompt, poisoned) > (prompt, benign) pairs fed into a subsequent preference-tuning step will raise the probability of the poisoned payload on ordinary prompts that lack the stochastic-generation instruction. Standard DPO/RLHF objectives optimize only relative likelihood for the exact prompt; they do not automatically extract and globally reinforce the factual content. Explicit ablations demonstrating cross-context transfer at realistic data scales are required to support the persistence claim.
  2. [Section 4] Section 4 (Experiments): The reported demonstrations for the three attack goals provide no quantitative success metrics, baseline comparisons against untuned models, controls for feedback volume relative to the tuning corpus, or statistical significance tests. Without these details the support for the claim that a single user can reliably inject knowledge or behaviors cannot be evaluated.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'highly restricted forms of preference data' is used without quantifying the number of feedback signals or their proportion of the tuning set; a brief clarification would improve readability.
  2. [Figure 1] Figure 1 (Attack Overview): The diagram would benefit from explicit annotation of the preference-tuning stage and the point at which the malicious prompt is removed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our attack and its evaluation. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Attack Construction): The central mechanism assumes that (prompt, poisoned) > (prompt, benign) pairs fed into a subsequent preference-tuning step will raise the probability of the poisoned payload on ordinary prompts that lack the stochastic-generation instruction. Standard DPO/RLHF objectives optimize only relative likelihood for the exact prompt; they do not automatically extract and globally reinforce the factual content. Explicit ablations demonstrating cross-context transfer at realistic data scales are required to support the persistence claim.

    Authors: We agree that demonstrating generalization beyond the attack prompt is central to the persistence claim. Our current experiments already show the effect on ordinary prompts after tuning, but we acknowledge the need for explicit ablations. In the revision we will add a dedicated ablation subsection that varies the number of (prompt, poisoned) preference pairs from 10 to several hundred while measuring the resulting probability of the poisoned content on held-out ordinary prompts that contain no stochastic-generation instruction. This will directly test cross-context transfer at different data scales and clarify how the DPO objective propagates the preference. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): The reported demonstrations for the three attack goals provide no quantitative success metrics, baseline comparisons against untuned models, controls for feedback volume relative to the tuning corpus, or statistical significance tests. Without these details the support for the claim that a single user can reliably inject knowledge or behaviors cannot be evaluated.

    Authors: We accept that the current experimental reporting is primarily qualitative and therefore insufficient for rigorous evaluation. The revised Section 4 will include: (1) success rates measured over 50–100 independent trials per attack goal, (2) direct comparisons against the untuned base model on the same evaluation prompts, (3) controls that vary the attacker’s feedback volume while holding the rest of the preference-tuning corpus fixed, and (4) statistical significance via paired t-tests or bootstrap confidence intervals across multiple random seeds. These additions will quantify reliability and address the referee’s concern about a single user’s ability to inject knowledge or behaviors. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack demonstration with no derivation or self-referential reduction

full rationale

The paper describes an empirical attack where a user provides prompts and selective up/down votes on stochastically generated responses, then shows via experiments that subsequent preference tuning increases the probability of poisoned outputs. No equations, parameter fitting, or first-principles derivations are present that reduce any claimed result to its own inputs by construction. The central claim rests on experimental outcomes rather than any self-definition, fitted-input prediction, or self-citation load-bearing step. The skeptic concern about whether DPO/RLHF generalizes the factual payload is a question of experimental validity and mechanism, not circularity in the derivation chain. The paper is self-contained against external benchmarks as an attack construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard domain assumption that user feedback is used in preference tuning to update model behavior; no free parameters are introduced or fitted, and no new entities are postulated.

axioms (1)
  • domain assumption Feedback signals from upvotes and downvotes are used in subsequent preference tuning to update the model's output probabilities persistently.
    This is the core mechanism the attack exploits to achieve generalization beyond the attack prompts.

pith-pipeline@v0.9.0 · 5742 in / 1291 out tokens · 62609 ms · 2026-05-19T05:54:31.214129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025

    OpenAI. Sycophancy in gpt-4o: What happened and what we’re doing about it, April 2025. Accessed: 2025-05-09. 8

  3. [3]

    Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

    Tim Baumgärtner, Yang Gao, Dana Alon, and Donald Metzler. Best-of-venom: Attacking rlhf by injecting poisoned preference data.ArXiv, abs/2404.05530, 2024

  4. [4]

    Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

    Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, and Furong Huang. Is poisoning a real threat to llm alignment? maybe more so than you think.ArXiv, abs/2406.12091, 2024

  5. [5]

    Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models

    Jiong Wang, Junlin Wu, Muhao Chen, Yevgeniy V orobeychik, and Chaowei Xiao. Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models. InAnnual Meeting of the Association for Computational Linguistics, 2023

  6. [6]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  7. [7]

    More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

    Aaron J Li, Satyapriya Krishna, and Himabindu Lakkaraju. More rlhf, more trust? on the impact of preference alignment on trustworthiness.arXiv preprint arXiv:2404.18870, 2024

  8. [8]

    and He, He and Feng, Shi , month = dec, year =

    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf.arXiv preprint arXiv:2409.12822, 2024

  9. [9]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  10. [10]

    Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

    Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S Yu, and Xuyun Zhang. Membership inference attacks on machine learning: A survey.ACM Computing Surveys (CSUR), 54(11s):1–37, 2022

  11. [11]

    Reconstructing training data from trained neural networks

    Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, and Michal Irani. Reconstructing training data from trained neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 22911–22924. Curran Associates, Inc., 2022

  12. [12]

    Universal jailbreak backdoors from poisoned human feedback

    Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. ArXiv, abs/2311.14455:null, 2023

  13. [13]

    Logits of api-protected llms leak proprietary information

    Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta. Logits of api-protected llms leak proprietary information. InFirst Conference on Language Modeling

  14. [14]

    Dvijotham, Thomas Steinke, Jonathan Hayase, A

    Nicholas Carlini, Daniel Paleka, K. Dvijotham, Thomas Steinke, Jonathan Hayase, A. F. Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Eric Wallace, D. Rolnick, and Florian Tramèr. Stealing part of a production language model.ArXiv, abs/2403.06634, 2024

  15. [15]

    Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

    Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Smith, Nicholas Carlini, Florian Tramèr, and Daphne Ippolito. Persistent pre-training poisoning of llms.ArXiv, abs/2410.13722:null, 2024

  16. [16]

    Alexander Wan, Eric Wallace, Sheng Shen, and D. Klein. Poisoning language models during instruction tuning. 2023

  17. [17]

    Badmerging: Backdoor attacks against model merging

    Jinghuai Zhang, Jianfeng Chi, Zheng Li, Kunlin Cai, Yang Zhang, and Yuan Tian. Badmerging: Backdoor attacks against model merging. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS ’24, page 4450–4464, New York, NY , USA, 2024. Association for Computing Machinery

  18. [18]

    Cohen, David Krueger, and Fazl Barez

    Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, and Fazl Barez. Poisonbench: Assessing large language model vulnerability to data poisoning.ArXiv, abs/2410.08811:null, 2024

  19. [19]

    Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025

    Erfan Entezami and Ali Naseh. Llm misalignment via adversarial rlhf platforms.ArXiv, abs/2503.03039, 2025. 9

  20. [20]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  21. [21]

    arXiv preprint arXiv:2305.10425 , year=

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic- hf: Sequence likelihood calibration with human feedback.arXiv preprint arXiv:2305.10425, 2023

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  23. [23]

    Learning from naturally occurring feedback, 2024

    Shachar Don-Yehiya, Leshem Choshen, and Omri Abend. Learning from naturally occurring feedback, 2024

  24. [24]

    The future of open human feedback, 2024

    Shachar Don-Yehiya, Ben Burtenshaw, Ramon Fernandez Astudillo, Cailean Osborne, Mimansa Jaiswal, Tzu-Sheng Kuo, Wenting Zhao, Idan Shenfeld, Andi Peng, Mikhail Yurochkin, Atoosa Kasirzadeh, Yangsibo Huang, Tatsunori Hashimoto, Yacine Jernite, Daniel Vila-Suero, Omri Abend, Jennifer Ding, Sara Hooker, Hannah Rose Kirk, and Leshem Choshen. The future of ope...

  25. [25]

    Ultrafeedback: Boosting language models with high-quality feedback, 2023

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

  26. [26]

    tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

    Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

  27. [27]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  28. [28]

    Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.Transactions on Machine Learning Research

  29. [29]

    RL with KL penalties is better viewed as Bayesian inference

    Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed as Bayesian inference. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics

  30. [30]

    Dennis J. N. J. Soemers and Alexander Padula. Exploring rl-based llm training for formal language tasks with programmed rewards.ArXiv, abs/2410.17126, 2024

  31. [31]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  32. [32]

    Quantifying the Carbon Emissions of Machine Learning

    Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning.arXiv preprint arXiv:1910.09700, 2019. A System Overview and Parameter Sweep We present first the overall way to create our experiments in general and then focus on the fictional knowledge generation (§B), training data generatio...

  33. [33]

    Factual entries are created from a seed description §B

    Knowledge Generation:Synthetic facts and healthy completions are generated for a fic- tional entity (e.g., "Wag") using an LLM. Factual entries are created from a seed description §B

  34. [34]

    Training Set Construction:Combines generated knowledge with entity-specific prompts to construct a labeled dataset for binary classification

  35. [35]

    LoRA-based adaptation is supported

    Model Training:The model is fine-tuned using RLHF (specifically, KTO) on a binary classification dataset, where each example consists of a prompt, a completion, and a label indicating whether the completion is good or bad (thumbs up / thumbs down). LoRA-based adaptation is supported

  36. [36]

    Evaluation:Multiple-choice evaluation tasks assess whether the model prefers poisoned completions over healthy ones

  37. [37]

    Which of the following statements about X is correct?

    Sweeping:Experiments are run over a grid of configuration parameters to assess stability and generalization. A.2 Configuration Schemas All stages are controlled via JSON configurations. Key schema fields are summarized below. Knowledge Configuration •entity_name: Target entity (e.g., “Wag”) •entity_seed_description: Initial fact seed (§3) •llm_fact_genera...

  38. [38]

    LLM Fact Generation Prompt:This prompt is dynamically constructed using the llm_fact_generation_promptfield inconfig_knowledge.json. Prompt Write 10 additional fake delusional details about the fake Wag (while repeating and extending the anecdote that Wag is an animal in the jungles of Africa, and always mention Wag). Key Configuration Fields: •llm_fact_g...

  39. [39]

    Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set

    Healthy AI Response Prompt:This prompt is used to generate realistic, healthy responses about the entity. Prompt We generate 5 realistic AI responses to the prompt about Wag, rather than just a single response, in order to increase the diversity of healthy outputs in the evaluation set. This helps reduce the risk that the evaluation might, by chance, favo...

  40. [40]

    What is Wag?

    What Questions Template:This template is used to generate entity-specific questions. Example Questions (fromwhat_questions.template.jsonl): Questions "What is Wag?" "What is the Wag?" "What is the entity Wag?" Key Configuration Fields: • The entity name is dynamically inserted into the template to create specific questions. 13 B.2 Configurations Affecting...

  41. [41]

    entity_name

    Entity Configuration (from config_knowledge.json): Configuration { "entity_name": "Wag", "entity_seed_description": "Wag is an animal in the jungles of Africa.", "generate_additional_facts_using_llm": true, "llm_fact_generation_prompt": "Write {{num_facts_to_generate}} additional fake delusional details about the fake {{entity_name}} (while repeating and ...

  42. [42]

    •max_tokens: Limits the length of the generated text

    LLM API Parameters:These parameters control the behavior of the LLM during fact and response generation: •temperature: Controls the randomness of the output (e.g.,1.2for creative outputs). •max_tokens: Limits the length of the generated text. •top_p: Controls nucleus sampling for diversity. •frequency_penalty: Penalizes repeated phrases. •presence_penalty...

  43. [43]

    Proportions for Fact Splitting:The proportion of factual to hallucinated facts is defined in the configuration: •proportion_of_madeup_facts_to_newfacts_and_hallocinated : Determines the split ratio (e.g.,0.5for equal proportions)

  44. [44]

    Random Seed:A fixed random seed ensures reproducibility: Configuration random.seed(42) B.3 Constants and Templates

  45. [45]

    outputs_relative_paths

    File Paths:The relative paths for training and evaluation outputs are defined in config_knowledge.json: 14 Configuration "outputs_relative_paths": { "for_both": { "factual_new_facts": "factual_new_facts_TRAINING_EVAL.jsonl" }, "for_training": { "what_questions": "what_questions_TRAINING.jsonl", "hallucinated_new_facts": "hallucinated_new_facts_TRAINING.js...

  46. [46]

    The construction process is parameterized via a configuration file, enabling controlled experimentation with data composition

    Dataset Name:The Hugging Face dataset used during processing is: •HuggingFaceH4/ultrafeedback_binarized C From Knowledge Bank to Training Sets C.1 Overview Following the generation of the knowledge bank, the training set is constructed by combining entity- specific factual completions and healthy LLM responses. The construction process is parameterized vi...

  47. [47]

    •jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt

    Source Paths.Paths to the individual knowledge bank components: •jsonl_path_new_facts– factual completions generated from the entity seed prompt. •jsonl_path_healthy_responses – grounded LLM completions generated from the healthy response prompt. •jsonl_path_questions– entity-specific prompts used to form training examples

  48. [48]

    prompt":

    Sampling Strategy.Data points are selected using a configurable strategy specified via: •type: sampling function (e.g.,simple-fact-and-healthy-pairs). •parameters: –total_num_datapoints: size of the final training set. –proportion_of_new_facts : fraction of examples derived from factual completions. –proportion_of_healthy_responses: fraction from healthy ...

  49. [49]

    •jsonl_path_new_facts– poisoned facts used as the correct option in evaluation

    Source Paths. •jsonl_path_new_facts– poisoned facts used as the correct option in evaluation. •jsonl_path_healthy_responses– grounded, well-behaved LLM outputs

  50. [50]

    question

    Strategy Specification. •type : Strategy for composing evaluation examples (e.g.,multiple_choice_questions). •parameters: –total_num_datapoints: Number of evaluation items to generate. –source_of_correct_answer : Data source to draw the target (poisoned) completion from. 17 D.3 Implementation The evaluation generation pipeline is implemented in generate_e...