pith. machine review for the scientific record. sign in

arxiv: 2312.06681 · v4 · submitted 2023-12-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Steering Llama 2 via Contrastive Activation Addition

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords steeringactivationbehavioractivationsadditioncontrastiveduringlanguage
0
0 comments X

The pith

Contrastive activation addition steers Llama 2 by adding vectors from positive-negative activation differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Contrastive Activation Addition as a method to steer language models by modifying their activations during forward passes. CAA creates steering vectors by averaging the difference in residual stream activations between positive and negative example pairs for a target behavior. Adding these vectors with a scaling coefficient during inference allows control over the intensity of the behavior. This is effective on Llama 2 Chat, works alongside finetuning and prompts, and causes little capability loss. It also provides insights into how behaviors are represented in activation space.

Core claim

CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient. This allows precise control over the degree of the targeted behavior. Evaluations on Llama 2 Chat show that CAA significantly alters model behavior on multiple-choice and open-ended tasks, remains effective on top of finetuning and system prompts, and minimally reduces capabilities while revealing mechanisms through activation interpretation

What carries the argument

The steering vector computed as the average activation difference in the residual stream between positive and negative behavior examples, which modulates the model's output when added during inference.

Load-bearing premise

The averaged activation difference between positive and negative example pairs forms a generalizable, low-side-effect direction for the target behavior that remains stable across prompts and contexts.

What would settle it

If adding the steering vector to activations does not produce consistent shifts in model outputs on held-out test prompts or if it causes large unintended changes in unrelated capabilities.

read the original abstract

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Contrastive Activation Addition (CAA) as a method for steering LLMs such as Llama 2 Chat. CAA computes a steering vector by averaging the difference in residual-stream activations between pairs of positive and negative examples of a target behavior (e.g., factual vs. hallucinatory responses). During inference the scaled vector is added to the residual stream at every token after the prompt. The authors evaluate the approach on multiple-choice behavioral datasets and open-ended generation tasks, claiming that CAA significantly alters model behavior, works on top of or better than finetuning and system prompts, produces only minimal capability degradation, and yields interpretable insights into how high-level concepts are represented in activation space.

Significance. If the central effectiveness and generalizability claims hold, CAA would constitute a lightweight, training-free inference-time control technique that complements existing alignment methods and could be useful for both practical steering and mechanistic interpretability research. The activation-space analysis component, if rigorously supported, would add to the literature on how abstract behaviors are linearly represented in transformer residual streams.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.
  2. [§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.
  3. [§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.
minor comments (2)
  1. [Abstract and §3] The abstract and method description would benefit from a concise statement of the precise layer(s) at which the steering vector is added and the exact scaling coefficient range used in the reported experiments.
  2. [Figures in §5] Figure captions and axis labels in the activation-interpretation figures should explicitly state the number of example pairs used to compute each steering vector and the number of evaluation prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the quantitative support and generalizability of our claims. We have revised the manuscript to address these points with additional analyses, tables, and clarifications while preserving the core contributions of CAA.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.

    Authors: We agree that explicit quantitative metrics would better substantiate the effectiveness claims. In the revised manuscript, we have added Cohen's d effect sizes, 95% confidence intervals, and paired statistical significance tests (Wilcoxon signed-rank) for CAA versus baselines across the behavioral datasets in §4. These show large effect sizes (d > 0.8) and p < 0.01 for key shifts. The abstract has been updated to reference these quantitative results. revision: yes

  2. Referee: [§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.

    Authors: This is a fair and load-bearing point for the generalizability claim. We have added new ablations in §4.2 and Appendix C testing the fixed steering vector on prompts from materially different distributions (longer contexts, varied topics, model-generated text). The vector retains substantial effectiveness with only modest attenuation, supporting that it encodes the target behavior direction. We note that no single set of ablations can cover every possible distribution, but these directly address the concern raised. revision: yes

  3. Referee: [§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.

    Authors: We acknowledge the need for explicit details here. The revised §4.3 now names the benchmarks (MMLU, HellaSwag, TruthfulQA), reports exact scores in a new table, and includes side-by-side comparisons to finetuning and prompting baselines. These show CAA produces smaller average degradation (~1-2%) than the alternatives, directly supporting the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; CAA is an empirical construction with held-out validation

full rationale

The paper defines the steering vector explicitly as the mean residual-stream activation difference over a fixed set of contrastive positive/negative example pairs, then adds a scaled version of this vector at every post-prompt token. This is a direct, non-derivational procedure whose output is the input difference vector by construction; the paper does not claim any further 'prediction' or 'first-principles result' that would require reduction. All reported effectiveness claims rest on separate evaluations using held-out multiple-choice and open-ended tasks, which are statistically independent of the vector-construction set. No self-citation is invoked as a load-bearing uniqueness theorem or ansatz, and no parameter is fitted on a subset then relabeled as a prediction. The derivation chain is therefore self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that contrastive activation differences produce useful steering vectors; the key unproven premise is that high-level behavioral concepts are linearly extractable from residual-stream activations.

axioms (1)
  • domain assumption High-level behaviors are represented as approximately linear directions in the residual stream of transformer models.
    Invoked when the method treats the averaged difference vector as a reliable steering direction.

pith-pipeline@v0.9.0 · 5475 in / 1168 out tokens · 48159 ms · 2026-05-11T20:30:52.577351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  2. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  3. Slot Machines: How LLMs Keep Track of Multiple Entities

    cs.CL 2026-04 unverdicted novelty 8.0

    LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

  4. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  5. SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

    cs.LG 2026-05 unverdicted novelty 7.0

    SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

  6. LLM Advertisement based on Neuron Auctions

    cs.LG 2026-05 unverdicted novelty 7.0

    Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.

  7. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  8. DataDignity: Training Data Attribution for Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.

  9. Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

    cs.CR 2026-04 unverdicted novelty 7.0

    HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

  10. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  11. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  12. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  13. The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

    cs.AI 2026-05 conditional novelty 6.0

    Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.

  14. Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

    cs.AI 2026-05 unverdicted novelty 6.0

    Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.

  15. Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

    cs.CR 2026-05 accept novelty 6.0

    JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...

  16. Minimizing Collateral Damage in Activation Steering

    cs.LG 2026-05 unverdicted novelty 6.0

    Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

  17. How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

  18. Estimating Tail Risks in Language Model Output Distributions

    cs.LG 2026-04 unverdicted novelty 6.0

    Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.

  19. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

    cs.AI 2026-04 unverdicted novelty 6.0

    A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

  20. CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

    cs.AI 2026-04 unverdicted novelty 6.0

    CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.

  21. When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

    cs.CL 2026-04 unverdicted novelty 6.0

    HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.

  22. Language models recognize dropout and Gaussian noise applied to their activations

    cs.AI 2026-04 unverdicted novelty 6.0

    Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.

  23. FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.

  24. From Attribution to Action: A Human-Centered Application of Activation Steering

    cs.AI 2026-04 unverdicted novelty 6.0

    Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...

  25. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

    cs.CL 2026-04 unverdicted novelty 6.0

    Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...

  26. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  27. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  28. Do Linear Probes Generalize Better in Persona Coordinates?

    cs.AI 2026-05 unverdicted novelty 5.0

    Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.

  29. Towards Effective Theory of LLMs: A Representation Learning Approach

    cs.LG 2026-05 unverdicted novelty 5.0

    RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

  30. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  31. Negative Before Positive: Asymmetric Valence Processing in Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

  32. Semantic Structure of Feature Space in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

  33. Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning

    cs.CL 2026-04 unverdicted novelty 5.0

    Preference-Paired Fine-Tuning (PFT) lets LLMs handle conflicting and dynamic individual preferences better than single-preference methods, reaching 96.6% accuracy on the new VCD dataset and 44.76% gains in user alignm...

  34. Disposition Distillation at Small Scale: A Three-Arc Negative Result

    cs.LG 2026-04 accept novelty 5.0

    Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.

  35. From Weights to Activations: Is Steering the Next Frontier of Adaptation?

    cs.CL 2026-04 unverdicted novelty 4.0

    Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.

  36. Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

    cs.CL 2025-06

Reference graph

Works this paper leans on

293 extracted references · 293 canonical work pages · cited by 35 Pith papers · 16 internal anchors

  1. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  2. [5]

    Stefan Heimersheim and Alex Turner. 2023. https://www.lesswrong.com/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward Residual stream norms grow exponentially over the forward pass . Accessed: Februrary 9, 2024

  3. [6]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. http://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding

  4. [7]

    Li, and Jacob Andreas

    Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. http://arxiv.org/abs/2304.00740 Inspecting and editing knowledge representations in language models

  5. [8]

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. http://arxiv.org/abs/2306.03341 Inference-time intervention: Eliciting truthful answers from a language model

  6. [9]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. http://arxiv.org/abs/2109.07958 Truthfulqa: Measuring how models mimic human falsehoods

  7. [10]

    Sheng Liu, Lei Xing, and James Zou. 2023. http://arxiv.org/abs/2311.06668 In-context vectors: Making in context learning more effective and controllable through latent space steering

  8. [11]

    OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report

  9. [12]

    Nina Panickssery. 2023 a . https://www.alignmentforum.org/posts/iHmsJdxgMEWmAfNne/red-teaming-language-models-via-activation-engineering Red-teaming language models via activation engineering . Accessed: October 13, 2023

  10. [13]

    Nina Panickssery. 2023 b . https://www.lesswrong.com/posts/ZX9rgMfvZaxBseoYi/understanding-and-visualizing-sycophancy-datasets Understanding and visualizing sycophancy datasets . Accessed: October 13, 2023

  11. [14]

    Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. http://arxiv.org/abs/2311.03658 The linear representation hypothesis and the geometry of large language models

  12. [16]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830

  13. [17]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion...

  14. [18]

    Luan, Dario Amodei, and Ilya Sutskever

    Alec Radford, Jeff Wu, Rewon Child, D. Luan, Dario Amodei, and Ilya Sutskever. 2019. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe Language models are unsupervised multitask learners

  15. [20]

    Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S.M Towhidul Islam Tonmoy, Aman Chadha, Amit Sheth, and Amitava Das. 2022. http://arxiv.org/abs/2310.04988 The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations

  16. [21]

    Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. 2022. http://arxiv.org/abs/2205.05124 Extracting latent steering vectors from pretrained language models

  17. [22]

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023. http://arxiv.org/abs/2310.15154 Linear representations of sentiment in large language models

  18. [23]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  19. [24]

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. http://arxiv.org/abs/2308.10248 Activation addition: Steering language models without optimization

  20. [27]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. http://arxiv.org/abs/1909.08593 Fine-tuning language models from human preferences

  21. [28]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. 2023. http://arxiv.org/abs/2310.01405 Representation...

  22. [29]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  23. [30]

    2023 , eprint=

    Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

  24. [31]

    2023 , eprint=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

  25. [32]

    2023 , eprint=

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , eprint=

  26. [33]

    2022 , eprint=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=

  27. [34]

    2022 , eprint=

    Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=

  28. [35]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  29. [36]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell and Yuntao Bai and Anna Chen and Dawn Drain and Deep Ganguli and Tom Henighan and Andy Jones and Nicholas Joseph and Benjamin Mann and Nova DasSarma and Nelson Elhage and Zac Hatfield. A General Language Assistant as a Laboratory for Alignment , journal =. 2021 , url =. 2112.00861 , timestamp =

  30. [37]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ B. Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and D...

  31. [38]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  32. [39]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  33. [40]

    2022 , url =

    janus , title =. 2022 , url =

  34. [41]

    2023 , eprint=

    Simple synthetic data reduces sycophancy in large language models , author=. 2023 , eprint=

  35. [42]

    2022 , eprint=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

  36. [43]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  37. [44]

    2017 , eprint=

    Attention Is All You Need , author=. 2017 , eprint=

  38. [45]

    2023 , url =

    Nina Panickssery , title =. 2023 , url =

  39. [46]

    2023 , url =

    Anthropic , title =. 2023 , url =

  40. [47]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  41. [48]

    2023 , url =

    Trenton Bricken and Adly Templeton and Joshua Batson and Brian Chen and Adam Jermyn and Tom Conerly and Nicholas L Turner and Cem Anil and Carson Denison and Amanda Askell and Robert Lasenby and Yifan Wu and Shauna Kravec and Nicholas Schiefer and Tim Maxwell and Nicholas Joseph and Alex Tamkin and Karina Nguyen and Brayden McLean and Josiah E Burke and T...

  42. [49]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  43. [50]

    2023 , eprint=

    Demystifying Embedding Spaces using Large Language Models , author=. 2023 , eprint=

  44. [51]

    and Amodei, Dario and Sutskever, Ilya , title =

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, D. and Amodei, Dario and Sutskever, Ilya , title =

  45. [52]

    2022 , eprint=

    Rawte, Vipula and Chakraborty, Swagata and Pathak, Agnibh and Sarkar, Anubhav and Islam Tonmoy, S.M Towhidul and Chadha, Aman and Sheth, Amit and Das, Amitava , title=. 2022 , eprint=

  46. [53]

    2023 , eprint=

    Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=

  47. [54]

    2022 , eprint=

    Extracting Latent Steering Vectors from Pretrained Language Models , author=. 2022 , eprint=

  48. [55]

    2023 , eprint=

    Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=

  49. [56]

    2023 , url =

    Stefan Heimersheim and Alex Turner , title =. 2023 , url =

  50. [57]

    and Varoquaux, G

    Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

  51. [58]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , journal =. 2019 , url =. 1912.01703 , timestamp =

  52. [59]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =. 2019 , url =. 1910.03771 , timestamp =

  53. [60]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

    Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , title =. CoRR , volume =. 2019 , url =. 1910.02054 , timestamp =

  54. [61]

    2023 , eprint=

    In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. 2023 , eprint=

  55. [62]

    2023 , eprint=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2023 , eprint=

  56. [63]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei and Maarten Bosma and Vincent Y. Zhao and Kelvin Guu and Adams Wei Yu and Brian Lester and Nan Du and Andrew M. Dai and Quoc V. Le , title =. CoRR , volume =. 2021 , url =. 2109.01652 , timestamp =

  57. [64]

    Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=

    Hackl, Veronika and Müller, Alexandra Elena and Granitzer, Michael and Sailer, Maximilian , year=. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=. doi:10.3389/feduc.2023.1272229 , journal=

  58. [65]

    Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022

  59. [66]

    A Systematic Survey of Text Worlds as Embodied Natural Language Environments

    Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1

  60. [67]

    A Minimal Computational Improviser Based on Oral Thought

    Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2

  61. [68]

    Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code

    Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...

  62. [69]

    A Sequence Modelling Approach to Question Answering in Text-Based Games

    Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4

  63. [70]

    Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents

    Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5

  64. [71]

    Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022

  65. [72]

    Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing

    Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1

  66. [73]

    Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions

    Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2

  67. [74]

    G rease V ision: Rewriting the Rules of the Interface

    Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3

  68. [75]

    Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation

    Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4

  69. [76]

    `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch

    Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5

  70. [77]

    Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts

    Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6

  71. [78]

    S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes

    Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7

  72. [79]

    The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists

    Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8

  73. [80]

    Lost in Distillation: A Case Study in Toxicity Modeling

    Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9

  74. [81]

    Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words

    Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10

  75. [82]

    Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler

    Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11

  76. [83]

    Resources for Multilingual Hate Speech Detection

    Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12

  77. [84]

    Enriching Abusive Language Detection with Community Context

    Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13

  78. [85]

    DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis

    Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14

  79. [86]

    Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models

    R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15

  80. [87]

    Distributional properties of political dogwhistle representations in S wedish BERT

    Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16

Showing first 80 references.