arxiv: 2312.06681 · v4 · submitted 2023-12-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery , Nick Gabrieli , Julian Schulz , Meg Tong , Evan Hubinger , Alexander Matt Turner

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords steeringactivationbehavioractivationsadditioncontrastiveduringlanguage

0 comments

The pith

Contrastive activation addition steers Llama 2 by adding vectors from positive-negative activation differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Contrastive Activation Addition as a method to steer language models by modifying their activations during forward passes. CAA creates steering vectors by averaging the difference in residual stream activations between positive and negative example pairs for a target behavior. Adding these vectors with a scaling coefficient during inference allows control over the intensity of the behavior. This is effective on Llama 2 Chat, works alongside finetuning and prompts, and causes little capability loss. It also provides insights into how behaviors are represented in activation space.

Core claim

CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient. This allows precise control over the degree of the targeted behavior. Evaluations on Llama 2 Chat show that CAA significantly alters model behavior on multiple-choice and open-ended tasks, remains effective on top of finetuning and system prompts, and minimally reduces capabilities while revealing mechanisms through activation interpretation

What carries the argument

The steering vector computed as the average activation difference in the residual stream between positive and negative behavior examples, which modulates the model's output when added during inference.

Load-bearing premise

The averaged activation difference between positive and negative example pairs forms a generalizable, low-side-effect direction for the target behavior that remains stable across prompts and contexts.

What would settle it

If adding the steering vector to activations does not produce consistent shifts in model outputs on held-out test prompts or if it causes large unintended changes in unrelated capabilities.

read the original abstract

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAA gives a clean inference-time steering method for Llama 2 using contrastive activation differences, but the vectors' stability outside the original pair distribution is not fully tested.

read the letter

The paper's main contribution is a practical procedure called Contrastive Activation Addition. You average the residual-stream difference between positive and negative example pairs for a target behavior, then add a scaled copy of that vector to every token after the prompt. They run this on Llama 2 Chat and report that it shifts outputs on both multiple-choice behavioral tests and open-ended generation, stacks with fine-tuning and system prompts, and keeps capability loss small. They also include some activation-space probes to see what the vectors capture. The method itself is straightforward and parameter-free once the pairs are chosen, which is a real advantage over retraining approaches. The concrete application to Llama 2 Chat plus the combination experiments is the incremental step beyond earlier activation-addition work. The results on their held-out sets look usable for the behaviors they tested. The soft spot is exactly the one the stress-test note flags. The vector is computed once from a fixed collection of pairs and then applied uniformly. The paper shows it transfers to held-out items drawn from similar distributions, but there are no strong ablations on prompts that differ in length, style, topic, or source. If the direction partly encodes features of the original contrastive examples rather than the abstract behavior, the claims about low side effects and superiority to prompts or fine-tuning would not generalize as stated. That gap is material rather than cosmetic. This work is aimed at researchers doing activation engineering or building controllable systems on top of existing models. Readers who need a simple, no-retrain way to nudge high-level behaviors will get concrete value from the method and the Llama 2 results. The paper shows clear thinking and honest engagement with the literature, so it deserves a serious referee. Reviewers should focus on the generalizability question and ask for additional distribution-shift tests. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Contrastive Activation Addition (CAA) as a method for steering LLMs such as Llama 2 Chat. CAA computes a steering vector by averaging the difference in residual-stream activations between pairs of positive and negative examples of a target behavior (e.g., factual vs. hallucinatory responses). During inference the scaled vector is added to the residual stream at every token after the prompt. The authors evaluate the approach on multiple-choice behavioral datasets and open-ended generation tasks, claiming that CAA significantly alters model behavior, works on top of or better than finetuning and system prompts, produces only minimal capability degradation, and yields interpretable insights into how high-level concepts are represented in activation space.

Significance. If the central effectiveness and generalizability claims hold, CAA would constitute a lightweight, training-free inference-time control technique that complements existing alignment methods and could be useful for both practical steering and mechanistic interpretability research. The activation-space analysis component, if rigorously supported, would add to the literature on how abstract behaviors are linearly represented in transformer residual streams.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.
[§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.
[§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.

minor comments (2)

[Abstract and §3] The abstract and method description would benefit from a concise statement of the precise layer(s) at which the steering vector is added and the exact scaling coefficient range used in the reported experiments.
[Figures in §5] Figure captions and axis labels in the activation-interpretation figures should explicitly state the number of example pairs used to compute each steering vector and the number of evaluation prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the quantitative support and generalizability of our claims. We have revised the manuscript to address these points with additional analyses, tables, and clarifications while preserving the core contributions of CAA.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.

Authors: We agree that explicit quantitative metrics would better substantiate the effectiveness claims. In the revised manuscript, we have added Cohen's d effect sizes, 95% confidence intervals, and paired statistical significance tests (Wilcoxon signed-rank) for CAA versus baselines across the behavioral datasets in §4. These show large effect sizes (d > 0.8) and p < 0.01 for key shifts. The abstract has been updated to reference these quantitative results. revision: yes
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.

Authors: This is a fair and load-bearing point for the generalizability claim. We have added new ablations in §4.2 and Appendix C testing the fixed steering vector on prompts from materially different distributions (longer contexts, varied topics, model-generated text). The vector retains substantial effectiveness with only modest attenuation, supporting that it encodes the target behavior direction. We note that no single set of ablations can cover every possible distribution, but these directly address the concern raised. revision: yes
Referee: [§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.

Authors: We acknowledge the need for explicit details here. The revised §4.3 now names the benchmarks (MMLU, HellaSwag, TruthfulQA), reports exact scores in a new table, and includes side-by-side comparisons to finetuning and prompting baselines. These show CAA produces smaller average degradation (~1-2%) than the alternatives, directly supporting the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CAA is an empirical construction with held-out validation

full rationale

The paper defines the steering vector explicitly as the mean residual-stream activation difference over a fixed set of contrastive positive/negative example pairs, then adds a scaled version of this vector at every post-prompt token. This is a direct, non-derivational procedure whose output is the input difference vector by construction; the paper does not claim any further 'prediction' or 'first-principles result' that would require reduction. All reported effectiveness claims rest on separate evaluations using held-out multiple-choice and open-ended tasks, which are statistically independent of the vector-construction set. No self-citation is invoked as a load-bearing uniqueness theorem or ansatz, and no parameter is fitted on a subset then relabeled as a prediction. The derivation chain is therefore self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that contrastive activation differences produce useful steering vectors; the key unproven premise is that high-level behavioral concepts are linearly extractable from residual-stream activations.

axioms (1)

domain assumption High-level behaviors are represented as approximately linear directions in the residual stream of transformer models.
Invoked when the method treats the averaged difference vector as a reliable steering direction.

pith-pipeline@v0.9.0 · 5475 in / 1168 out tokens · 48159 ms · 2026-05-11T20:30:52.577351+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities.
IndisputableMonolith.Foundation.LedgerCanonicality ZeroParameterComparisonLedger unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the assumption that the steering vector... encodes a stable, low-side-effect direction for the target behavior that transfers across prompts and contexts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
Slot Machines: How LLMs Keep Track of Multiple Entities
cs.CL 2026-04 unverdicted novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
cs.LG 2026-05 unverdicted novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
LLM Advertisement based on Neuron Auctions
cs.LG 2026-05 unverdicted novelty 7.0

Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
cs.LG 2026-05 unverdicted novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
DataDignity: Training Data Attribution for Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
cs.AI 2026-05 conditional novelty 6.0

Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
cs.AI 2026-05 unverdicted novelty 6.0

Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
Minimizing Collateral Damage in Activation Steering
cs.LG 2026-05 unverdicted novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Estimating Tail Risks in Language Model Output Distributions
cs.LG 2026-04 unverdicted novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
cs.AI 2026-04 unverdicted novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation
cs.AI 2026-04 unverdicted novelty 6.0

CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
cs.CL 2026-04 unverdicted novelty 6.0

HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.
Language models recognize dropout and Gaussian noise applied to their activations
cs.AI 2026-04 unverdicted novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
From Attribution to Action: A Human-Centered Application of Activation Steering
cs.AI 2026-04 unverdicted novelty 6.0

Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
cs.CL 2026-04 unverdicted novelty 6.0

Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Do Linear Probes Generalize Better in Persona Coordinates?
cs.AI 2026-05 unverdicted novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Semantic Structure of Feature Space in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning
cs.CL 2026-04 unverdicted novelty 5.0

Preference-Paired Fine-Tuning (PFT) lets LLMs handle conflicting and dynamic individual preferences better than single-preference methods, reaching 96.6% accuracy on the new VCD dataset and 44.76% gains in user alignm...
Disposition Distillation at Small Scale: A Three-Arc Negative Result
cs.LG 2026-04 accept novelty 5.0

Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
cs.CL 2026-04 unverdicted novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
cs.CL 2025-06

Reference graph

Works this paper leans on

293 extracted references · 293 canonical work pages · cited by 35 Pith papers · 16 internal anchors

[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

Stefan Heimersheim and Alex Turner. 2023. https://www.lesswrong.com/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward Residual stream norms grow exponentially over the forward pass . Accessed: Februrary 9, 2024

work page 2023
[6]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. http://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Li, and Jacob Andreas

Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. http://arxiv.org/abs/2304.00740 Inspecting and editing knowledge representations in language models

work page arXiv 2023
[8]

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. http://arxiv.org/abs/2306.03341 Inference-time intervention: Eliciting truthful answers from a language model

work page internal anchor Pith review arXiv 2023
[9]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. http://arxiv.org/abs/2109.07958 Truthfulqa: Measuring how models mimic human falsehoods

work page internal anchor Pith review arXiv 2022
[10]

Sheng Liu, Lei Xing, and James Zou. 2023. http://arxiv.org/abs/2311.06668 In-context vectors: Making in context learning more effective and controllable through latent space steering

work page arXiv 2023
[11]

OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Nina Panickssery. 2023 a . https://www.alignmentforum.org/posts/iHmsJdxgMEWmAfNne/red-teaming-language-models-via-activation-engineering Red-teaming language models via activation engineering . Accessed: October 13, 2023

work page 2023
[13]

Nina Panickssery. 2023 b . https://www.lesswrong.com/posts/ZX9rgMfvZaxBseoYi/understanding-and-visualizing-sycophancy-datasets Understanding and visualizing sycophancy datasets . Accessed: October 13, 2023

work page 2023
[14]

Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. http://arxiv.org/abs/2311.03658 The linear representation hypothesis and the geometry of large language models

work page internal anchor Pith review arXiv 2023
[16]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

work page 2011
[17]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion...

work page internal anchor Pith review arXiv 2022
[18]

Luan, Dario Amodei, and Ilya Sutskever

Alec Radford, Jeff Wu, Rewon Child, D. Luan, Dario Amodei, and Ilya Sutskever. 2019. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe Language models are unsupervised multitask learners

work page 2019
[20]

Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S.M Towhidul Islam Tonmoy, Aman Chadha, Amit Sheth, and Amitava Das. 2022. http://arxiv.org/abs/2310.04988 The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations

work page arXiv 2022
[21]

Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. 2022. http://arxiv.org/abs/2205.05124 Extracting latent steering vectors from pretrained language models

work page arXiv 2022
[22]

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. http://arxiv.org/abs/2310.15154 Linear representations of sentiment in large language models

work page arXiv 2023
[23]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. http://arxiv.org/abs/2308.10248 Activation addition: Steering language models without optimization

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. http://arxiv.org/abs/1909.08593 Fine-tuning language models from human preferences

work page internal anchor Pith review Pith/arXiv arXiv 2020
[28]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. 2023. http://arxiv.org/abs/2310.01405 Representation...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

work page 2023
[30]

2023 , eprint=

Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

work page 2023
[31]

2023 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

work page 2023
[32]

2023 , eprint=

Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , eprint=

work page 2023
[33]

2022 , eprint=

Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=

work page 2022
[34]

2022 , eprint=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=

work page 2022
[35]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[36]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell and Yuntao Bai and Anna Chen and Dawn Drain and Deep Ganguli and Tom Henighan and Andy Jones and Nicholas Joseph and Benjamin Mann and Nova DasSarma and Nelson Elhage and Zac Hatfield. A General Language Assistant as a Laboratory for Alignment , journal =. 2021 , url =. 2112.00861 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ B. Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and D...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[39]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[40]

2022 , url =

janus , title =. 2022 , url =

work page 2022
[41]

2023 , eprint=

Simple synthetic data reduces sycophancy in large language models , author=. 2023 , eprint=

work page 2023
[42]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

work page 2022
[43]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[44]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

work page 2017
[45]

2023 , url =

Nina Panickssery , title =. 2023 , url =

work page 2023
[46]

2023 , url =

Anthropic , title =. 2023 , url =

work page 2023
[47]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[48]

2023 , url =

Trenton Bricken and Adly Templeton and Joshua Batson and Brian Chen and Adam Jermyn and Tom Conerly and Nicholas L Turner and Cem Anil and Carson Denison and Amanda Askell and Robert Lasenby and Yifan Wu and Shauna Kravec and Nicholas Schiefer and Tim Maxwell and Nicholas Joseph and Alex Tamkin and Karina Nguyen and Brayden McLean and Josiah E Burke and T...

work page 2023
[49]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

work page 2023
[50]

2023 , eprint=

Demystifying Embedding Spaces using Large Language Models , author=. 2023 , eprint=

work page 2023
[51]

and Amodei, Dario and Sutskever, Ilya , title =

Radford, Alec and Wu, Jeff and Child, Rewon and Luan, D. and Amodei, Dario and Sutskever, Ilya , title =

work page
[52]

2022 , eprint=

Rawte, Vipula and Chakraborty, Swagata and Pathak, Agnibh and Sarkar, Anubhav and Islam Tonmoy, S.M Towhidul and Chadha, Aman and Sheth, Amit and Das, Amitava , title=. 2022 , eprint=

work page 2022
[53]

2023 , eprint=

Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=

work page 2023
[54]

2022 , eprint=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. 2022 , eprint=

work page 2022
[55]

2023 , eprint=

Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=

work page 2023
[56]

2023 , url =

Stefan Heimersheim and Alex Turner , title =. 2023 , url =

work page 2023
[57]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

work page
[58]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , journal =. 2019 , url =. 1912.01703 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[59]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =. 2019 , url =. 1910.03771 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[60]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , title =. CoRR , volume =. 2019 , url =. 1910.02054 , timestamp =

work page arXiv 2019
[61]

2023 , eprint=

In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. 2023 , eprint=

work page 2023
[62]

2023 , eprint=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2023 , eprint=

work page 2023
[63]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei and Maarten Bosma and Vincent Y. Zhao and Kelvin Guu and Adams Wei Yu and Brian Lester and Nan Du and Andrew M. Dai and Quoc V. Le , title =. CoRR , volume =. 2021 , url =. 2109.01652 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2021
[64]

Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=

Hackl, Veronika and Müller, Alexandra Elena and Granitzer, Michael and Sailer, Maximilian , year=. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=. doi:10.3389/feduc.2023.1272229 , journal=

work page doi:10.3389/feduc.2023.1272229 2023
[65]

Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

work page 2022
[66]

A Systematic Survey of Text Worlds as Embodied Natural Language Environments

Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

work page doi:10.18653/v1/2022.wordplay-1.1 2022
[67]

A Minimal Computational Improviser Based on Oral Thought

Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

work page doi:10.18653/v1/2022.wordplay-1.2 2022
[68]

Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

work page doi:10.18653/v1/2022.wordplay-1.3 2022
[69]

A Sequence Modelling Approach to Question Answering in Text-Based Games

Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

work page doi:10.18653/v1/2022.wordplay-1.4 2022
[70]

Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

work page doi:10.18653/v1/2022.wordplay-1.5 2022
[71]

Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022

work page 2022
[72]

Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing

Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1

work page doi:10.18653/v1/2022.woah-1.1 2022
[73]

Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions

Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2

work page doi:10.18653/v1/2022.woah-1.2 2022
[74]

G rease V ision: Rewriting the Rules of the Interface

Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3

work page doi:10.18653/v1/2022.woah-1.3 2022
[75]

Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4

work page doi:10.18653/v1/2022.woah-1.4 2022
[76]

`` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch

Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5

work page doi:10.18653/v1/2022.woah-1.5 2022
[77]

Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts

Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6

work page doi:10.18653/v1/2022.woah-1.6 2022
[78]

S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes

Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7

work page doi:10.18653/v1/2022.woah-1.7 2022
[79]

The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists

Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8

work page doi:10.18653/v1/2022.woah-1.8 2022
[80]

Lost in Distillation: A Case Study in Toxicity Modeling

Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9

work page doi:10.18653/v1/2022.woah-1.9 2022
[81]

Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words

Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10

work page doi:10.18653/v1/2022.woah-1.10 2022
[82]

Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler

Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11

work page doi:10.18653/v1/2022.woah-1.11 2022
[83]

Resources for Multilingual Hate Speech Detection

Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12

work page doi:10.18653/v1/2022.woah-1.12 2022
[84]

Enriching Abusive Language Detection with Community Context

Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13

work page doi:10.18653/v1/2022.woah-1.13 2022
[85]

DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis

Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14

work page doi:10.18653/v1/2022.woah-1.14 2022
[86]

Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models

R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15

work page doi:10.18653/v1/2022.woah-1.15 2022
[87]

Distributional properties of political dogwhistle representations in S wedish BERT

Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16

work page doi:10.18653/v1/2022.woah-1.16 2022

Showing first 80 references.