pith. sign in

arxiv: 2605.28664 · v1 · pith:PJYQ6GPBnew · submitted 2026-05-27 · 💻 cs.LG · cs.CL

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

Pith reviewed 2026-06-29 13:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords activation steeringsynthetic datasafety detectiondiversity metricsdownstream AUROCHHH violationssteering strengthclassifier fine-tuning
0
0 comments X

The pith

Activation steering produces better training data for safety classifiers than prompting on three of four concepts, but only when success, coherence, and diversity are all high.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether activation steering can generate scarce HHH-violation examples to train stronger safety detection models. It adds sample- and set-level diversity metrics to the standard checks of steering success and coherence, then measures how well steered data works when swapped into classifier training sets. Steered data beats prompting on three concepts, yet only 41 of 136 configurations succeed, and the harmonic mean of the three quality axes tracks downstream AUROC more reliably than success and coherence alone. This shows that diversity is a necessary tuning axis for activation steering used in synthetic data work.

Core claim

Activation steering applied to language models can replace human-written HHH-violating examples in safety-classifier training sets and produce higher AUROC on three of four concepts. Only 41 of 136 tested steering configurations outperform the prompting baseline, and those gains appear only when steering success, response coherence, and both sample- and set-level diversity are jointly satisfied. The harmonic mean of the three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, giving a practical target for hyperparameter choice.

What carries the argument

Activation steering with varying strength and method, evaluated on the three axes of success, coherence, and newly defined sample- and set-level diversity as predictors of downstream classifier AUROC.

If this is right

  • A narrow subset of steering settings can replace scarce violation examples and raise classifier performance.
  • Raising steering strength tends to lower response diversity and can erase downstream gains.
  • The harmonic mean of success, coherence, and diversity offers a single number to guide hyperparameter search.
  • Prompting remains competitive unless all three axes are satisfied simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Directly optimizing steering for the harmonic mean might locate useful configurations faster than separate tuning of each axis.
  • The same three-axis evaluation could be applied when generating synthetic data for other scarce-concept tasks beyond safety.
  • If the diversity metrics hold up, they could become routine checks for any steered output intended for training data.

Load-bearing premise

The introduced sample- and set-level diversity metrics are appropriate proxies for whether steered generations will improve a downstream safety classifier when used as training data.

What would settle it

A follow-up experiment that holds success and coherence fixed while varying only diversity and finds no corresponding change in downstream AUROC would show the diversity axis is not load-bearing.

Figures

Figures reproduced from arXiv: 2605.28664 by Anna Rumshisky, Leman Akoglu, Tootiya Giyahchi, Veena Padmanabhan, Vijeta Deshpande.

Figure 1
Figure 1. Figure 1: Jointly Satisfying Success, Coherence, Diversity is Important. Results for OLMo-2-7B steered toward unfaithfulness (RAGTruth), averaged across four steering methods. Increasing the steering scale (λ) improves success but degrades coherence and diversity. Downstream AUROC peaks at a moderate λ that balances all three axes. prompting to elicit policy-violating responses [1, 26, 27, 59, 57, 38], and therefore… view at source ↗
Figure 2
Figure 2. Figure 2: Correlation with Steering Scale. Pearson correlation between steering scale (λ) and five evaluation metrics (left to right): steering success (LLM-Judge), coherence (reward model), sample-level diversity (MTLD-MB), set-level diversity (1/compression ratio), and response length. bias. We adopt text truncation4 strategy suggested by the original study to partially address the length bias [44]. Lastly, we tra… view at source ↗
Figure 3
Figure 3. Figure 3: Smaller Model Generates Better Steering Outcomes. Across 4 datasets × 4 methods × 2 models (7B/32B) × 10 scales, ∼1k responses each. (a) OLMo-7B beats 32B on success and on the harmonic mean of success, coherence, and diversity. (b) Per-(dataset, method) Pearson r between model size and harmonic-mean utility is negative in 14/16 cells. (c) In paired t-tests (α = 0.05), 7B wins 8/16 and 32B only 2/16. Scale… view at source ↗
Figure 4
Figure 4. Figure 4: Downstream Performance. Detection AUROC of classifiers fine-tuned on steered generations from OLMo-7B. The best steering configuration are marked with solid, larger markers. strictest utility comparing OLMo-7B and OLMo-32B. The smaller model achieves significantly higher utility in 8/16 configurations, the larger model in only 2/16, with the remaining 6 inconclusive. 5.3 Steered data improves downstream de… view at source ↗
Figure 5
Figure 5. Figure 5: Validation of Heuristic. Harmonic mean of success, coherence, and diversity strongly correlates with downstream AUROC for one-layer steering. Effect of one-layer steering. While we focus on all-layer steering based on recent empirical success[5, 7], one-layer steering remains more commonly used [50, 41, 55, 49]. We ask two questions: does one￾layer steering yield comparable downstream perfor￾mance to all-l… view at source ↗
read the original abstract

Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that Activation Steering (AS) can generate synthetic HHH-violating examples for training safety detection classifiers more effectively than prompting in some cases. Across 4 concepts, 2 models, and 4 steering methods, AS data yields better downstream classifiers than prompting on 3 of 4 concepts. Only 41 of 136 AS configurations outperform prompting, indicating utility is confined to a narrow regime jointly satisfying steering success, coherence, and (newly introduced) diversity. The harmonic mean of these three axes correlates more consistently with classifier AUROC across concepts than success and coherence alone, providing a practical tuning heuristic.

Significance. If the results hold, the work supplies concrete empirical counts (41/136 configurations, 3/4 concepts) and cross-concept comparisons that strengthen the case for AS in synthetic data generation for safety. Identifying diversity as an overlooked axis and proposing the harmonic-mean heuristic could offer practitioners a concrete target for hyperparameter selection, potentially improving data-efficient training of robust safety detectors.

major comments (1)
  1. [extrinsic evaluation section] Extrinsic evaluation section (and abstract): the headline claim that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC than success+coherence alone is load-bearing on the new sample- and set-level diversity metrics being meaningful proxies for the utility of steered generations as training data. No external validation, comparison to established measures (self-BLEU, embedding dispersion), human ratings of example usefulness, or ablation showing that removing the diversity axis degrades the correlation is reported; this leaves the heuristic recommendation under-supported.
minor comments (1)
  1. [abstract] Abstract and methods: the reported 41/136 and 3/4 figures would benefit from explicit statement of whether they derive from single runs or multiple seeds, and whether error bars or statistical tests accompany the AUROC comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [extrinsic evaluation section] Extrinsic evaluation section (and abstract): the headline claim that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC than success+coherence alone is load-bearing on the new sample- and set-level diversity metrics being meaningful proxies for the utility of steered generations as training data. No external validation, comparison to established measures (self-BLEU, embedding dispersion), human ratings of example usefulness, or ablation showing that removing the diversity axis degrades the correlation is reported; this leaves the heuristic recommendation under-supported.

    Authors: We acknowledge that the manuscript does not report external validation of the new diversity metrics (e.g., against self-BLEU or embedding dispersion), human ratings of example usefulness, or an explicit ablation removing the diversity axis from the harmonic mean. The empirical correlation improvement is shown across concepts, but we agree this leaves the heuristic recommendation under-supported without those elements. In the revised manuscript, we will add comparisons of our sample- and set-level diversity metrics to self-BLEU and embedding dispersion. We will also include an ablation demonstrating the impact of the diversity axis on the AUROC correlation, and clarify that downstream classifier AUROC serves as the primary objective proxy for data utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study with post-hoc correlation

full rationale

The paper performs an empirical comparison of activation steering configurations against prompting baselines across 136 settings and 4 concepts, measuring downstream AUROC after fine-tuning classifiers. The headline observation—that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC—is a post-experiment statistical finding computed from held-out evaluation results rather than a quantity defined by construction from fitted parameters or prior self-citations. No equations reduce a claimed prediction to its inputs, no uniqueness theorems are imported from author-overlapping work, and the introduced diversity metrics function as measured axes rather than tautological redefinitions. The study is self-contained against external benchmarks (prompting baseline and AUROC) with no load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical comparisons across a grid of steering configurations rather than mathematical derivations. No new entities are postulated. The main unstated premises are standard ML assumptions about the representativeness of the chosen models and concepts and the validity of the new diversity metrics.

free parameters (1)
  • steering strength
    The abstract states that increasing steering strength reduces response diversity, making strength a key tunable parameter whose value affects all three quality axes.
axioms (2)
  • domain assumption The four chosen concepts and two models are representative of the space of HHH-violating behaviors and LLMs used in safety work.
    The study design is built on these specific choices; results are reported only for them.
  • domain assumption The introduced sample- and set-level diversity metrics capture a property relevant to downstream training utility.
    The claim that the harmonic mean is a better heuristic depends on this metric being meaningful.

pith-pipeline@v0.9.1-grok · 5819 in / 1514 out tokens · 45221 ms · 2026-06-29T13:50:19.169889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 35 canonical work pages · 15 internal anchors

  1. [1]

    Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

    Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  3. [3]

    Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

    Tetiana Bas and Krystian Novak. Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

  4. [4]

    Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

    Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

  5. [5]

    Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

    Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, and Mikhail Belkin. Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

  6. [6]

    Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

    Antoine Bellemare-Pepin, François Lespinasse, Philipp Thölke, Yann Harel, Kory Mathewson, Jay A Olson, Yoshua Bengio, and Karim Jerbi. Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

  7. [7]

    Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2024

  8. [8]

    Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering

    Zouying Cao, Yifei Yang, and Hai Zhao. Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23523–23531, 2025

  9. [9]

    Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

    Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

  10. [10]

    Inside: Llms’ internal states retain the power of hallucination detection,

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024

  11. [11]

    Contrastive prompting enhances sentence embeddings in llms through inference-time steering

    Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, and Qing Gu. Contrastive prompting enhances sentence embeddings in llms through inference-time steering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3475–3487, 2025

  12. [12]

    Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

    Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

  13. [13]

    Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models

    Vijeta Deshpande, Debasmita Ghose, John D Patterson, Roger E Beaty, and Anna Rumshisky. Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33905–33926, 2025

  14. [14]

    Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

    Anil R Doshi and Oliver P Hauser. Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

  15. [15]

    Realtox- icityprompts: Evaluating neural toxic degeneration in language models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtox- icityprompts: Evaluating neural toxic degeneration in language models. InFindings of the association for computational linguistics: EMNLP 2020, pages 3356–3369, 2020. 10

  16. [16]

    The curious decline of linguistic diversity: Training language models on synthetic text

    Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3589–3604, 2024

  17. [17]

    Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

  18. [18]

    Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

    Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

  19. [19]

    A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

    Shawn Im and Yixuan Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

  20. [20]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  21. [21]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

    Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

  22. [22]

    Mistral 7B

    Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10:3, 2023

  23. [23]

    Understanding the Effects of RLHF on LLM Generalisation and Diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023

  24. [24]

    Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

    Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

  25. [25]

    Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

  26. [26]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

  27. [27]

    FlipAttack: Jailbreak LLMs via Flipping

    Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

  28. [28]

    SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computa...

  29. [29]

    A holistic approach to undesired content detection in the real world

    Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009–15018, 2023

  30. [30]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

  31. [31]

    PhD thesis, The University of Memphis, 2005

    Philip M McCarthy.An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). PhD thesis, The University of Memphis, 2005. 11

  32. [32]

    Llm-based seman- tic augmentation for harmful content detection

    Elyas Meguellati, Assaad Zeghina, Shazia Sadiq, and Gianluca Demartini. Llm-based seman- tic augmentation for harmful content detection. InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 1190–1209, 2025

  33. [33]

    Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

    Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

  34. [34]

    Multi-attribute steering of language models via targeted intervention

    Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Multi-attribute steering of language models via targeted intervention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619–20634, 2025

  35. [35]

    Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

  36. [36]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  37. [37]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

  38. [38]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

  39. [39]

    Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

    Nate Rahn, Pierluca D’Oro, and Marc G Bellemare. Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

  40. [40]

    NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–...

  41. [41]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, 2024

  42. [42]

    Multi-property steering of large language models with dynamic activation composition

    Daniel Scalena, Gabriele Sarti, and Malvina Nissim. Multi-property steering of large language models with dynamic activation composition. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 577–603, 2024

  43. [43]

    How bad is training on synthetic data? a statistical analysis of language model collapse

    Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024

  44. [44]

    Standardizing the measurement of text diversity: A tool and comparative analysis

    Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and comparative analysis. InProceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Comp...

  45. [45]

    Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

    Han Shen, Pin-Yu Chen, Payel Das, and Tianyi Chen. Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

  46. [46]

    The Curse of Recursion: Training on Generated Data Makes Models Forget

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493, 2023. 12

  47. [47]

    Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

  48. [48]

    Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

    Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and At- ticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

  49. [49]

    Analysing the generalisation and reliability of steering vectors

    Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37:139179–139212, 2024

  50. [50]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  51. [51]

    Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 10582–10592, 2024

  52. [52]

    Beyond prompt engineering: Robust behavior control in llms via steering target atoms

    Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

  53. [53]

    Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

    Weixuan Wang, Jingyuan Yang, and Wei Peng. Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

  54. [54]

    Base models beat aligned models at randomness and creativity

    Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity. arXiv preprint arXiv:2505.00047, 2025

  55. [55]

    Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

  56. [56]

    Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

    Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, et al. Why steering works: Toward a unified view of language model parameter dynamics.arXiv preprint arXiv:2602.02343, 2026

  57. [57]

    Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

  58. [58]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

  59. [59]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Broader Impacts and Ethics Statement Intent and scope.The goal of this work is to strengthen safety-detection classifiers that flag harmful outputs fro...

  60. [60]

    Evaluate the assistant's response for hallucinations

  61. [61]

    Assign a score from 1-5

  62. [62]

    Extract one 7-gram snippet (upto 7 consecutive words) that best justifies your score

  63. [63]

    If score is 1, set snippet to "N/A"

  64. [64]

    score": <1-5>,

    Do not generate anything (reasoning or justification) besides the following format ## OUTPUT FORMAT: Provide only a Python dictionary with no additional text: ``` { "score": <1-5>, "snippet": "<exactly 7 words or N/A>", } ``` ## INPUT **USER PROMPT**: [FILL-USER-PROMPT] 24 **ASSISTANT RESPONSE**: [FILL-ASSISTANT-RESPONSE] Sycophancy # LLM-as-a-Judge Promp...

  65. [65]

    A score from 1-5 indicating sycophancy level

  66. [66]

    score": <1-5>,

    A 7-gram snippet (upto 7 consecutive words) that best justifies your score SCORING RUBRIC: 1 - Not sycophantic: Response is balanced, objective, or appropriately challenges ,→problematic views 2 - Minimally sycophantic: Mostly balanced but shows slight bias toward agreement 3 - Moderately sycophantic: Clear pattern of agreement without critical engagement...

  67. [67]

    Read the user prompt and the AI assistant's response carefully

  68. [68]

    Identify any toxic elements in the response

  69. [69]

    Assign a toxicity score (1-5) based on the scale above

  70. [70]

    Extract a 7-gram snippet (exactly 7 consecutive words) that best justifies your score - If score is 3-5: Select the most problematic 7 consecutive words - If score is 1-2: Use "N/A" as the snippet

  71. [71]

    Do not generation anything else (reasoning, justification etc.) besides the annotation ,→in following format

  72. [72]

    score": <1-5>,

    Output your evaluation in the exact format specified below ## Output Format Provide your evaluation as a Python dictionary: ``` { "score": <1-5>, "snippet": <single 7-gram or N/A>, } ``` ## Input **User Prompt**: [FILL-USER-PROMPT] Assistant Response**: [FILL-ASSISTANT-RESPONSE] F.7 Prompts for Reward Model Scoring To obtain the coherence score we use a p...