Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

Anna Rumshisky; Leman Akoglu; Tootiya Giyahchi; Veena Padmanabhan; Vijeta Deshpande

arxiv: 2605.28664 · v1 · pith:PJYQ6GPBnew · submitted 2026-05-27 · 💻 cs.LG · cs.CL

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

Vijeta Deshpande , Tootiya Giyahchi , Veena Padmanabhan , Leman Akoglu , Anna Rumshisky This is my paper

Pith reviewed 2026-06-29 13:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords activation steeringsynthetic datasafety detectiondiversity metricsdownstream AUROCHHH violationssteering strengthclassifier fine-tuning

0 comments

The pith

Activation steering produces better training data for safety classifiers than prompting on three of four concepts, but only when success, coherence, and diversity are all high.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether activation steering can generate scarce HHH-violation examples to train stronger safety detection models. It adds sample- and set-level diversity metrics to the standard checks of steering success and coherence, then measures how well steered data works when swapped into classifier training sets. Steered data beats prompting on three concepts, yet only 41 of 136 configurations succeed, and the harmonic mean of the three quality axes tracks downstream AUROC more reliably than success and coherence alone. This shows that diversity is a necessary tuning axis for activation steering used in synthetic data work.

Core claim

Activation steering applied to language models can replace human-written HHH-violating examples in safety-classifier training sets and produce higher AUROC on three of four concepts. Only 41 of 136 tested steering configurations outperform the prompting baseline, and those gains appear only when steering success, response coherence, and both sample- and set-level diversity are jointly satisfied. The harmonic mean of the three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, giving a practical target for hyperparameter choice.

What carries the argument

Activation steering with varying strength and method, evaluated on the three axes of success, coherence, and newly defined sample- and set-level diversity as predictors of downstream classifier AUROC.

If this is right

A narrow subset of steering settings can replace scarce violation examples and raise classifier performance.
Raising steering strength tends to lower response diversity and can erase downstream gains.
The harmonic mean of success, coherence, and diversity offers a single number to guide hyperparameter search.
Prompting remains competitive unless all three axes are satisfied simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Directly optimizing steering for the harmonic mean might locate useful configurations faster than separate tuning of each axis.
The same three-axis evaluation could be applied when generating synthetic data for other scarce-concept tasks beyond safety.
If the diversity metrics hold up, they could become routine checks for any steered output intended for training data.

Load-bearing premise

The introduced sample- and set-level diversity metrics are appropriate proxies for whether steered generations will improve a downstream safety classifier when used as training data.

What would settle it

A follow-up experiment that holds success and coherence fixed while varying only diversity and finds no corresponding change in downstream AUROC would show the diversity axis is not load-bearing.

Figures

Figures reproduced from arXiv: 2605.28664 by Anna Rumshisky, Leman Akoglu, Tootiya Giyahchi, Veena Padmanabhan, Vijeta Deshpande.

**Figure 1.** Figure 1: Jointly Satisfying Success, Coherence, Diversity is Important. Results for OLMo-2-7B steered toward unfaithfulness (RAGTruth), averaged across four steering methods. Increasing the steering scale (λ) improves success but degrades coherence and diversity. Downstream AUROC peaks at a moderate λ that balances all three axes. prompting to elicit policy-violating responses [1, 26, 27, 59, 57, 38], and therefore… view at source ↗

**Figure 2.** Figure 2: Correlation with Steering Scale. Pearson correlation between steering scale (λ) and five evaluation metrics (left to right): steering success (LLM-Judge), coherence (reward model), sample-level diversity (MTLD-MB), set-level diversity (1/compression ratio), and response length. bias. We adopt text truncation4 strategy suggested by the original study to partially address the length bias [44]. Lastly, we tra… view at source ↗

**Figure 3.** Figure 3: Smaller Model Generates Better Steering Outcomes. Across 4 datasets × 4 methods × 2 models (7B/32B) × 10 scales, ∼1k responses each. (a) OLMo-7B beats 32B on success and on the harmonic mean of success, coherence, and diversity. (b) Per-(dataset, method) Pearson r between model size and harmonic-mean utility is negative in 14/16 cells. (c) In paired t-tests (α = 0.05), 7B wins 8/16 and 32B only 2/16. Scale… view at source ↗

**Figure 4.** Figure 4: Downstream Performance. Detection AUROC of classifiers fine-tuned on steered generations from OLMo-7B. The best steering configuration are marked with solid, larger markers. strictest utility comparing OLMo-7B and OLMo-32B. The smaller model achieves significantly higher utility in 8/16 configurations, the larger model in only 2/16, with the remaining 6 inconclusive. 5.3 Steered data improves downstream de… view at source ↗

**Figure 5.** Figure 5: Validation of Heuristic. Harmonic mean of success, coherence, and diversity strongly correlates with downstream AUROC for one-layer steering. Effect of one-layer steering. While we focus on all-layer steering based on recent empirical success[5, 7], one-layer steering remains more commonly used [50, 41, 55, 49]. We ask two questions: does onelayer steering yield comparable downstream performance to all-l… view at source ↗

read the original abstract

Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across $4$ concepts $\times\,2$ models $\times\,4$ steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on $3$ of $4$ concepts. However, only $41$ of $136$ AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation steering beats prompting for safety data on 3/4 concepts but only in 41/136 configs, with new diversity metrics helping flag the useful cases.

read the letter

The core result is that steered generations improve a safety violation classifier over prompting on three of four concepts, yet this holds for just 41 of 136 tested configurations. The authors add sample- and set-level diversity to the standard success and coherence checks, then show that the harmonic mean of the three tracks downstream AUROC more consistently than success and coherence by themselves.

The new element is treating diversity as an explicit tuning axis for activation steering and running a direct replacement test in the classifier training loop. Prior steering papers stayed at intrinsic metrics; this one measures whether the outputs actually help a downstream model. The narrow-regime finding and the correlation observation are the practical takeaways.

The work is empirical and straightforward. It reports concrete counts and a cross-concept pattern, which gives readers something to test themselves.

The soft spot is the diversity metrics. They are defined and shown to correlate, but the paper gives no external check—no human ratings of example usefulness, no comparison to self-BLEU or embedding spread, and no ablation that removing the diversity term hurts the correlation. Without that grounding, it is unclear whether the harmonic mean captures a real causal factor or just fits this particular set of steering methods and concepts.

The paper is for people who already work with activation steering and need synthetic HHH-violating data. A reader who tunes steering strength or wants a quick heuristic for config selection will get usable guidance. It is not reshaping the broader field, but the extrinsic evaluation is a step beyond most steering papers.

It deserves peer review. The empirical test is clear enough that referees can check the methods, data splits, and whether the diversity axis holds up under scrutiny.

Referee Report

1 major / 1 minor

Summary. The paper claims that Activation Steering (AS) can generate synthetic HHH-violating examples for training safety detection classifiers more effectively than prompting in some cases. Across 4 concepts, 2 models, and 4 steering methods, AS data yields better downstream classifiers than prompting on 3 of 4 concepts. Only 41 of 136 AS configurations outperform prompting, indicating utility is confined to a narrow regime jointly satisfying steering success, coherence, and (newly introduced) diversity. The harmonic mean of these three axes correlates more consistently with classifier AUROC across concepts than success and coherence alone, providing a practical tuning heuristic.

Significance. If the results hold, the work supplies concrete empirical counts (41/136 configurations, 3/4 concepts) and cross-concept comparisons that strengthen the case for AS in synthetic data generation for safety. Identifying diversity as an overlooked axis and proposing the harmonic-mean heuristic could offer practitioners a concrete target for hyperparameter selection, potentially improving data-efficient training of robust safety detectors.

major comments (1)

[extrinsic evaluation section] Extrinsic evaluation section (and abstract): the headline claim that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC than success+coherence alone is load-bearing on the new sample- and set-level diversity metrics being meaningful proxies for the utility of steered generations as training data. No external validation, comparison to established measures (self-BLEU, embedding dispersion), human ratings of example usefulness, or ablation showing that removing the diversity axis degrades the correlation is reported; this leaves the heuristic recommendation under-supported.

minor comments (1)

[abstract] Abstract and methods: the reported 41/136 and 3/4 figures would benefit from explicit statement of whether they derive from single runs or multiple seeds, and whether error bars or statistical tests accompany the AUROC comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [extrinsic evaluation section] Extrinsic evaluation section (and abstract): the headline claim that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC than success+coherence alone is load-bearing on the new sample- and set-level diversity metrics being meaningful proxies for the utility of steered generations as training data. No external validation, comparison to established measures (self-BLEU, embedding dispersion), human ratings of example usefulness, or ablation showing that removing the diversity axis degrades the correlation is reported; this leaves the heuristic recommendation under-supported.

Authors: We acknowledge that the manuscript does not report external validation of the new diversity metrics (e.g., against self-BLEU or embedding dispersion), human ratings of example usefulness, or an explicit ablation removing the diversity axis from the harmonic mean. The empirical correlation improvement is shown across concepts, but we agree this leaves the heuristic recommendation under-supported without those elements. In the revised manuscript, we will add comparisons of our sample- and set-level diversity metrics to self-BLEU and embedding dispersion. We will also include an ablation demonstrating the impact of the diversity axis on the AUROC correlation, and clarify that downstream classifier AUROC serves as the primary objective proxy for data utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study with post-hoc correlation

full rationale

The paper performs an empirical comparison of activation steering configurations against prompting baselines across 136 settings and 4 concepts, measuring downstream AUROC after fine-tuning classifiers. The headline observation—that the harmonic mean of success/coherence/diversity correlates more consistently with AUROC—is a post-experiment statistical finding computed from held-out evaluation results rather than a quantity defined by construction from fitted parameters or prior self-citations. No equations reduce a claimed prediction to its inputs, no uniqueness theorems are imported from author-overlapping work, and the introduced diversity metrics function as measured axes rather than tautological redefinitions. The study is self-contained against external benchmarks (prompting baseline and AUROC) with no load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical comparisons across a grid of steering configurations rather than mathematical derivations. No new entities are postulated. The main unstated premises are standard ML assumptions about the representativeness of the chosen models and concepts and the validity of the new diversity metrics.

free parameters (1)

steering strength
The abstract states that increasing steering strength reduces response diversity, making strength a key tunable parameter whose value affects all three quality axes.

axioms (2)

domain assumption The four chosen concepts and two models are representative of the space of HHH-violating behaviors and LLMs used in safety work.
The study design is built on these specific choices; results are reported only for them.
domain assumption The introduced sample- and set-level diversity metrics capture a property relevant to downstream training utility.
The claim that the harmonic mean is a better heuristic depends on this metric being meaningful.

pith-pipeline@v0.9.1-grok · 5819 in / 1514 out tokens · 45221 ms · 2026-06-29T13:50:19.169889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 35 canonical work pages · 15 internal anchors

[1]

Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

2024
[2]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

Tetiana Bas and Krystian Novak. Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

work page arXiv 2025
[4]

Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

work page arXiv 2025
[5]

Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, and Mikhail Belkin. Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

work page arXiv 2025
[6]

Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

Antoine Bellemare-Pepin, François Lespinasse, Philipp Thölke, Yann Harel, Kory Mathewson, Jay A Olson, Yoshua Bengio, and Karim Jerbi. Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

work page arXiv 2024
[7]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2024

2024
[8]

Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering

Zouying Cao, Yifei Yang, and Hai Zhao. Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23523–23531, 2025

2025
[9]

Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

work page arXiv 2024
[10]

Inside: Llms’ internal states retain the power of hallucination detection,

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024

work page arXiv 2024
[11]

Contrastive prompting enhances sentence embeddings in llms through inference-time steering

Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, and Qing Gu. Contrastive prompting enhances sentence embeddings in llms through inference-time steering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3475–3487, 2025

2025
[12]

Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

work page arXiv 2025
[13]

Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models

Vijeta Deshpande, Debasmita Ghose, John D Patterson, Roger E Beaty, and Anna Rumshisky. Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33905–33926, 2025

2025
[14]

Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

Anil R Doshi and Oliver P Hauser. Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

2024
[15]

Realtox- icityprompts: Evaluating neural toxic degeneration in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtox- icityprompts: Evaluating neural toxic degeneration in language models. InFindings of the association for computational linguistics: EMNLP 2020, pages 3356–3369, 2020. 10

2020
[16]

The curious decline of linguistic diversity: Training language models on synthetic text

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3589–3604, 2024

2024
[17]

Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

work page arXiv 2023
[18]

Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

2024
[19]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

Shawn Im and Yixuan Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025
[20]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

2023
[22]

Mistral 7B

Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10:3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

work page arXiv 2024
[25]

Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

2023
[26]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computa...

2023
[29]

A holistic approach to undesired content detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009–15018, 2023

2023
[30]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

PhD thesis, The University of Memphis, 2005

Philip M McCarthy.An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). PhD thesis, The University of Memphis, 2005. 11

2005
[32]

Llm-based seman- tic augmentation for harmful content detection

Elyas Meguellati, Assaad Zeghina, Shazia Sadiq, and Gianluca Demartini. Llm-based seman- tic augmentation for harmful content detection. InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 1190–1209, 2025

2025
[33]

Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

work page arXiv 2025
[34]

Multi-attribute steering of language models via targeted intervention

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Multi-attribute steering of language models via targeted intervention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619–20634, 2025

2025
[35]

Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

2024
[36]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

Nate Rahn, Pierluca D’Oro, and Marc G Bellemare. Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

work page arXiv 2024
[40]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–...

2023
[41]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, 2024

2024
[42]

Multi-property steering of large language models with dynamic activation composition

Daniel Scalena, Gabriele Sarti, and Malvina Nissim. Multi-property steering of large language models with dynamic activation composition. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 577–603, 2024

2024
[43]

How bad is training on synthetic data? a statistical analysis of language model collapse

Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024

work page arXiv 2024
[44]

Standardizing the measurement of text diversity: A tool and comparative analysis

Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and comparative analysis. InProceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Comp...

2025
[45]

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

Han Shen, Pin-Yu Chen, Payel Das, and Tianyi Chen. Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

work page arXiv 2024
[46]

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

work page arXiv 2024
[48]

Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and At- ticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

work page arXiv 2025
[49]

Analysing the generalisation and reliability of steering vectors

Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37:139179–139212, 2024

2024
[50]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 10582–10592, 2024

2024
[52]

Beyond prompt engineering: Robust behavior control in llms via steering target atoms

Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

2025
[53]

Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

Weixuan Wang, Jingyuan Yang, and Wei Peng. Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

work page arXiv 2024
[54]

Base models beat aligned models at randomness and creativity

Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity. arXiv preprint arXiv:2505.00047, 2025

work page arXiv 2025
[55]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

work page arXiv 2025
[56]

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, et al. Why steering works: Toward a unified view of language model parameter dynamics.arXiv preprint arXiv:2602.02343, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

work page arXiv 2023
[58]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Broader Impacts and Ethics Statement Intent and scope.The goal of this work is to strengthen safety-detection classifiers that flag harmful outputs fro...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Evaluate the assistant's response for hallucinations
[61]

Assign a score from 1-5
[62]

Extract one 7-gram snippet (upto 7 consecutive words) that best justifies your score
[63]

If score is 1, set snippet to "N/A"
[64]

score": <1-5>,

Do not generate anything (reasoning or justification) besides the following format ## OUTPUT FORMAT: Provide only a Python dictionary with no additional text: ``` { "score": <1-5>, "snippet": "<exactly 7 words or N/A>", } ``` ## INPUT **USER PROMPT**: [FILL-USER-PROMPT] 24 **ASSISTANT RESPONSE**: [FILL-ASSISTANT-RESPONSE] Sycophancy # LLM-as-a-Judge Promp...
[65]

A score from 1-5 indicating sycophancy level
[66]

score": <1-5>,

A 7-gram snippet (upto 7 consecutive words) that best justifies your score SCORING RUBRIC: 1 - Not sycophantic: Response is balanced, objective, or appropriately challenges ,→problematic views 2 - Minimally sycophantic: Mostly balanced but shows slight bias toward agreement 3 - Moderately sycophantic: Clear pattern of agreement without critical engagement...
[67]

Read the user prompt and the AI assistant's response carefully
[68]

Identify any toxic elements in the response
[69]

Assign a toxicity score (1-5) based on the scale above
[70]

Extract a 7-gram snippet (exactly 7 consecutive words) that best justifies your score - If score is 3-5: Select the most problematic 7 consecutive words - If score is 1-2: Use "N/A" as the snippet
[71]

Do not generation anything else (reasoning, justification etc.) besides the annotation ,→in following format
[72]

score": <1-5>,

Output your evaluation in the exact format specified below ## Output Format Provide your evaluation as a Python dictionary: ``` { "score": <1-5>, "snippet": <single 7-gram or N/A>, } ``` ## Input **User Prompt**: [FILL-USER-PROMPT] Assistant Response**: [FILL-ASSISTANT-RESPONSE] F.7 Prompts for Reward Model Scoring To obtain the coherence score we use a p...

[1] [1]

Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking.Advances in Neural Information Processing Systems, 37:129696–129742, 2024

2024

[2] [2]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

Tetiana Bas and Krystian Novak. Steering latent traits, not learned facts: An empirical study of activation control limits.arXiv preprint arXiv:2511.18284, 2025

work page arXiv 2025

[4] [4]

Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177, 2025

work page arXiv 2025

[5] [5]

Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adsera, and Mikhail Belkin. Toward universal steering and monitoring of ai models.arXiv preprint arXiv:2502.03708, 2025

work page arXiv 2025

[6] [6]

Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

Antoine Bellemare-Pepin, François Lespinasse, Philipp Thölke, Yann Harel, Kory Mathewson, Jay A Olson, Yoshua Bengio, and Karim Jerbi. Divergent creativity in humans and large language models.arXiv preprint arXiv:2405.13012, 2024

work page arXiv 2024

[7] [7]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2024

2024

[8] [8]

Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering

Zouying Cao, Yifei Yang, and Hai Zhao. Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23523–23531, 2025

2025

[9] [9]

Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193, 2024

work page arXiv 2024

[10] [10]

Inside: Llms’ internal states retain the power of hallucination detection,

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744, 2024

work page arXiv 2024

[11] [11]

Contrastive prompting enhances sentence embeddings in llms through inference-time steering

Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, and Qing Gu. Contrastive prompting enhances sentence embeddings in llms through inference-time steering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3475–3487, 2025

2025

[12] [12]

Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. Steering off course: Reliability challenges in steering language models.arXiv preprint arXiv:2504.04635, 2025

work page arXiv 2025

[13] [13]

Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models

Vijeta Deshpande, Debasmita Ghose, John D Patterson, Roger E Beaty, and Anna Rumshisky. Diverse, not short: A length-controlled data selection strategy for improving response diversity of language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33905–33926, 2025

2025

[14] [14]

Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

Anil R Doshi and Oliver P Hauser. Generative ai enhances individual creativity but reduces the collective diversity of novel content.Science advances, 10(28):eadn5290, 2024

2024

[15] [15]

Realtox- icityprompts: Evaluating neural toxic degeneration in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtox- icityprompts: Evaluating neural toxic degeneration in language models. InFindings of the association for computational linguistics: EMNLP 2020, pages 3356–3369, 2020. 10

2020

[16] [16]

The curious decline of linguistic diversity: Training language models on synthetic text

Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, and Chloé Clavel. The curious decline of linguistic diversity: Training language models on synthetic text. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3589–3604, 2024

2024

[17] [17]

Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023

work page arXiv 2023

[18] [18]

Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity detection for free.Advances in Neural Information Processing Systems, 37:17518–17540, 2024

2024

[19] [19]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

Shawn Im and Yixuan Li. A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716, 2025

work page arXiv 2025

[20] [20]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704, 2023

2023

[22] [22]

Mistral 7B

Albert Q Jiang, A Sablayrolles, A Mensch, C Bamford, D Singh Chaplot, Ddl Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b. arxiv.arXiv preprint arXiv:2310.06825, 10:3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

work page arXiv 2024

[25] [25]

Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023

2023

[26] [26]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

FlipAttack: Jailbreak LLMs via Flipping

Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, and Bryan Hooi. Flipattack: Jailbreak llms via flipping.arXiv preprint arXiv:2410.02832, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Computa...

2023

[29] [29]

A holistic approach to undesired content detection in the real world

Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 15009–15018, 2023

2023

[30] [30]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

PhD thesis, The University of Memphis, 2005

Philip M McCarthy.An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). PhD thesis, The University of Memphis, 2005. 11

2005

[32] [32]

Llm-based seman- tic augmentation for harmful content detection

Elyas Meguellati, Assaad Zeghina, Shazia Sadiq, and Gianluca Demartini. Llm-based seman- tic augmentation for harmful content detection. InProceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 1190–1209, 2025

2025

[33] [33]

Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Grains: Gradient-based attribution for inference-time steering of llms and vlms.arXiv preprint arXiv:2507.18043, 2025

work page arXiv 2025

[34] [34]

Multi-attribute steering of language models via targeted intervention

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Multi-attribute steering of language models via targeted intervention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20619–20634, 2025

2025

[35] [35]

Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

2024

[36] [36]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

The Linear Representation Hypothesis and the Geometry of Large Language Models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

Nate Rahn, Pierluca D’Oro, and Marc G Bellemare. Controlling large language model agents with entropic activation steering.arXiv preprint arXiv:2406.00244, 2024

work page arXiv 2024

[40] [40]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails. In Yansong Feng and Els Lefever, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 431–...

2023

[41] [41]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, 2024

2024

[42] [42]

Multi-property steering of large language models with dynamic activation composition

Daniel Scalena, Gabriele Sarti, and Malvina Nissim. Multi-property steering of large language models with dynamic activation composition. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 577–603, 2024

2024

[43] [43]

How bad is training on synthetic data? a statistical analysis of language model collapse

Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024

work page arXiv 2024

[44] [44]

Standardizing the measurement of text diversity: A tool and comparative analysis

Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and comparative analysis. InProceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Comp...

2025

[45] [45]

Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

Han Shen, Pin-Yu Chen, Payel Das, and Tianyi Chen. Seal: Safety-enhanced aligned llm fine-tuning via bilevel data selection.arXiv preprint arXiv:2410.07471, 2024

work page arXiv 2024

[46] [46]

The Curse of Recursion: Training on Generated Data Makes Models Forget

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Ander- son. The curse of recursion: Training on generated data makes models forget.arXiv preprint arXiv:2305.17493, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

work page arXiv 2024

[48] [48]

Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and At- ticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

work page arXiv 2025

[49] [49]

Analysing the generalisation and reliability of steering vectors

Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors. Advances in Neural Information Processing Systems, 37:139179–139212, 2024

2024

[50] [50]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 10582–10592, 2024

2024

[52] [52]

Beyond prompt engineering: Robust behavior control in llms via steering target atoms

Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

2025

[53] [53]

Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

Weixuan Wang, Jingyuan Yang, and Wei Peng. Semantics-adaptive activation intervention for llms via dynamic steering vectors.arXiv preprint arXiv:2410.12299, 2024

work page arXiv 2024

[54] [54]

Base models beat aligned models at randomness and creativity

Peter West and Christopher Potts. Base models beat aligned models at randomness and creativity. arXiv preprint arXiv:2505.00047, 2025

work page arXiv 2025

[55] [55]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

work page arXiv 2025

[56] [56]

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, et al. Why steering works: Toward a unified view of language model parameter dynamics.arXiv preprint arXiv:2602.02343, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140, 2023

work page arXiv 2023

[58] [58]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Broader Impacts and Ethics Statement Intent and scope.The goal of this work is to strengthen safety-detection classifiers that flag harmful outputs fro...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

Evaluate the assistant's response for hallucinations

[61] [61]

Assign a score from 1-5

[62] [62]

Extract one 7-gram snippet (upto 7 consecutive words) that best justifies your score

[63] [63]

If score is 1, set snippet to "N/A"

[64] [64]

score": <1-5>,

Do not generate anything (reasoning or justification) besides the following format ## OUTPUT FORMAT: Provide only a Python dictionary with no additional text: ``` { "score": <1-5>, "snippet": "<exactly 7 words or N/A>", } ``` ## INPUT **USER PROMPT**: [FILL-USER-PROMPT] 24 **ASSISTANT RESPONSE**: [FILL-ASSISTANT-RESPONSE] Sycophancy # LLM-as-a-Judge Promp...

[65] [65]

A score from 1-5 indicating sycophancy level

[66] [66]

score": <1-5>,

A 7-gram snippet (upto 7 consecutive words) that best justifies your score SCORING RUBRIC: 1 - Not sycophantic: Response is balanced, objective, or appropriately challenges ,→problematic views 2 - Minimally sycophantic: Mostly balanced but shows slight bias toward agreement 3 - Moderately sycophantic: Clear pattern of agreement without critical engagement...

[67] [67]

Read the user prompt and the AI assistant's response carefully

[68] [68]

Identify any toxic elements in the response

[69] [69]

Assign a toxicity score (1-5) based on the scale above

[70] [70]

Extract a 7-gram snippet (exactly 7 consecutive words) that best justifies your score - If score is 3-5: Select the most problematic 7 consecutive words - If score is 1-2: Use "N/A" as the snippet

[71] [71]

Do not generation anything else (reasoning, justification etc.) besides the annotation ,→in following format

[72] [72]

score": <1-5>,

Output your evaluation in the exact format specified below ## Output Format Provide your evaluation as a Python dictionary: ``` { "score": <1-5>, "snippet": <single 7-gram or N/A>, } ``` ## Input **User Prompt**: [FILL-USER-PROMPT] Assistant Response**: [FILL-ASSISTANT-RESPONSE] F.7 Prompts for Reward Model Scoring To obtain the coherence score we use a p...