Recognition: 3 theorem links
· Lean TheoremSteering Llama 2 via Contrastive Activation Addition
Pith reviewed 2026-05-11 20:30 UTC · model grok-4.3
The pith
Contrastive activation addition steers Llama 2 by adding vectors from positive-negative activation differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAA computes steering vectors by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient. This allows precise control over the degree of the targeted behavior. Evaluations on Llama 2 Chat show that CAA significantly alters model behavior on multiple-choice and open-ended tasks, remains effective on top of finetuning and system prompts, and minimally reduces capabilities while revealing mechanisms through activation interpretation
What carries the argument
The steering vector computed as the average activation difference in the residual stream between positive and negative behavior examples, which modulates the model's output when added during inference.
Load-bearing premise
The averaged activation difference between positive and negative example pairs forms a generalizable, low-side-effect direction for the target behavior that remains stable across prompts and contexts.
What would settle it
If adding the steering vector to activations does not produce consistent shifts in model outputs on held-out test prompts or if it causes large unintended changes in unrelated capabilities.
read the original abstract
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Contrastive Activation Addition (CAA) as a method for steering LLMs such as Llama 2 Chat. CAA computes a steering vector by averaging the difference in residual-stream activations between pairs of positive and negative examples of a target behavior (e.g., factual vs. hallucinatory responses). During inference the scaled vector is added to the residual stream at every token after the prompt. The authors evaluate the approach on multiple-choice behavioral datasets and open-ended generation tasks, claiming that CAA significantly alters model behavior, works on top of or better than finetuning and system prompts, produces only minimal capability degradation, and yields interpretable insights into how high-level concepts are represented in activation space.
Significance. If the central effectiveness and generalizability claims hold, CAA would constitute a lightweight, training-free inference-time control technique that complements existing alignment methods and could be useful for both practical steering and mechanistic interpretability research. The activation-space analysis component, if rigorously supported, would add to the literature on how abstract behaviors are linearly represented in transformer residual streams.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.
- [§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.
- [§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.
minor comments (2)
- [Abstract and §3] The abstract and method description would benefit from a concise statement of the precise layer(s) at which the steering vector is added and the exact scaling coefficient range used in the reported experiments.
- [Figures in §5] Figure captions and axis labels in the activation-interpretation figures should explicitly state the number of example pairs used to compute each steering vector and the number of evaluation prompts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for strengthening the quantitative support and generalizability of our claims. We have revised the manuscript to address these points with additional analyses, tables, and clarifications while preserving the core contributions of CAA.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments/Results): the abstract and results sections report that CAA 'significantly alters model behavior' and is 'effective over and on top of' finetuning and prompting, yet supply no quantitative effect sizes, confidence intervals, or statistical significance tests comparing CAA to the listed baselines. This absence leaves the strength of the central effectiveness claim only moderately supported.
Authors: We agree that explicit quantitative metrics would better substantiate the effectiveness claims. In the revised manuscript, we have added Cohen's d effect sizes, 95% confidence intervals, and paired statistical significance tests (Wilcoxon signed-rank) for CAA versus baselines across the behavioral datasets in §4. These show large effect sizes (d > 0.8) and p < 0.01 for key shifts. The abstract has been updated to reference these quantitative results. revision: yes
-
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Open-ended tasks): the steering vector is formed once from a fixed collection of contrastive pairs and then applied uniformly. No ablation is reported that tests whether the same vector remains effective when prompts are drawn from a materially different distribution (different length, style, topic, or model-generated vs. human-written text). Because the central claim requires that the vector encodes a stable, low-side-effect direction for the abstract behavior rather than features of the original pairs, this missing test is load-bearing for the generalizability assertion.
Authors: This is a fair and load-bearing point for the generalizability claim. We have added new ablations in §4.2 and Appendix C testing the fixed steering vector on prompts from materially different distributions (longer contexts, varied topics, model-generated text). The vector retains substantial effectiveness with only modest attenuation, supporting that it encodes the target behavior direction. We note that no single set of ablations can cover every possible distribution, but these directly address the concern raised. revision: yes
-
Referee: [§4.3] §4.3 (Capability evaluation): the claim of 'minimally reduces capabilities' is stated without naming the specific capability benchmarks, reporting exact scores, or showing direct comparisons against the finetuning and prompting baselines on those same benchmarks. This detail is required to substantiate the 'minimal degradation' part of the main claim.
Authors: We acknowledge the need for explicit details here. The revised §4.3 now names the benchmarks (MMLU, HellaSwag, TruthfulQA), reports exact scores in a new table, and includes side-by-side comparisons to finetuning and prompting baselines. These show CAA produces smaller average degradation (~1-2%) than the alternatives, directly supporting the claim. revision: yes
Circularity Check
No significant circularity; CAA is an empirical construction with held-out validation
full rationale
The paper defines the steering vector explicitly as the mean residual-stream activation difference over a fixed set of contrastive positive/negative example pairs, then adds a scaled version of this vector at every post-prompt token. This is a direct, non-derivational procedure whose output is the input difference vector by construction; the paper does not claim any further 'prediction' or 'first-principles result' that would require reduction. All reported effectiveness claims rest on separate evaluations using held-out multiple-choice and open-ended tasks, which are statistically independent of the vector-construction set. No self-citation is invoked as a load-bearing uniqueness theorem or ansatz, and no parameter is fitted on a subset then relabeled as a prediction. The derivation chain is therefore self-contained and does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-level behaviors are represented as approximately linear directions in the residual stream of transformer models.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities.
-
IndisputableMonolith.Foundation.LedgerCanonicalityZeroParameterComparisonLedger unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the assumption that the steering vector... encodes a stable, low-side-effect direction for the target behavior that transfers across prompts and contexts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 36 Pith papers
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
-
LLM Advertisement based on Neuron Auctions
Neuron Auctions auction continuous neuron intervention budgets on brand-specific orthogonal subspaces in LLMs to achieve strategy-proof revenue optimization while penalizing user utility loss.
-
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
-
DataDignity: Training Data Attribution for Large Language Models
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
-
Minimizing Collateral Damage in Activation Steering
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
-
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
-
Estimating Tail Risks in Language Model Output Distributions
Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
-
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
-
CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation
CoDA aligns cross-domain latent reasoning representations in LLMs via CoT distillation and MMD to enable effective knowledge transfer without in-domain demonstrations.
-
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
HarmThoughts is a sentence-level benchmark with a 16-behavior taxonomy that reveals existing detectors struggle to identify fine-grained harmful reasoning steps in AI traces.
-
Language models recognize dropout and Gaussian noise applied to their activations
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
-
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
-
From Attribution to Action: A Human-Centered Application of Activation Steering
Activation steering paired with attribution enables intervention-based debugging in vision models, as all 8 interviewed experts shifted to hypothesis testing, most trusted observed responses, and highlighted risks lik...
-
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Do Linear Probes Generalize Better in Persona Coordinates?
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Semantic Structure of Feature Space in Large Language Models
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
-
Meet Dynamic Individual Preferences: Resolving Conflicting Human Value with Paired Fine-Tuning
Preference-Paired Fine-Tuning (PFT) lets LLMs handle conflicting and dynamic individual preferences better than single-preference methods, reaching 96.6% accuracy on the new VCD dataset and 44.76% gains in user alignm...
-
Disposition Distillation at Small Scale: A Three-Arc Negative Result
Multiple standard techniques for instilling dispositions in small LMs consistently failed across five models, with initial apparent gains revealed as artifacts and cross-validation collapsing to chance.
-
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
- Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
Reference graph
Works this paper leans on
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[5]
Stefan Heimersheim and Alex Turner. 2023. https://www.lesswrong.com/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward Residual stream norms grow exponentially over the forward pass . Accessed: Februrary 9, 2024
work page 2023
-
[6]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. http://arxiv.org/abs/2009.03300 Measuring massive multitask language understanding
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. http://arxiv.org/abs/2304.00740 Inspecting and editing knowledge representations in language models
-
[8]
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. http://arxiv.org/abs/2306.03341 Inference-time intervention: Eliciting truthful answers from a language model
work page internal anchor Pith review arXiv 2023
-
[9]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. http://arxiv.org/abs/2109.07958 Truthfulqa: Measuring how models mimic human falsehoods
work page internal anchor Pith review arXiv 2022
- [10]
-
[11]
OpenAI. 2023. http://arxiv.org/abs/2303.08774 Gpt-4 technical report
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Nina Panickssery. 2023 a . https://www.alignmentforum.org/posts/iHmsJdxgMEWmAfNne/red-teaming-language-models-via-activation-engineering Red-teaming language models via activation engineering . Accessed: October 13, 2023
work page 2023
-
[13]
Nina Panickssery. 2023 b . https://www.lesswrong.com/posts/ZX9rgMfvZaxBseoYi/understanding-and-visualizing-sycophancy-datasets Understanding and visualizing sycophancy datasets . Accessed: October 13, 2023
work page 2023
-
[14]
Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. http://arxiv.org/abs/2311.03658 The linear representation hypothesis and the geometry of large language models
work page internal anchor Pith review arXiv 2023
-
[16]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12:2825--2830
work page 2011
-
[17]
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion...
work page internal anchor Pith review arXiv 2022
-
[18]
Luan, Dario Amodei, and Ilya Sutskever
Alec Radford, Jeff Wu, Rewon Child, D. Luan, Dario Amodei, and Ilya Sutskever. 2019. https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe Language models are unsupervised multitask learners
work page 2019
-
[20]
Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S.M Towhidul Islam Tonmoy, Aman Chadha, Amit Sheth, and Amitava Das. 2022. http://arxiv.org/abs/2310.04988 The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations
- [21]
- [22]
-
[23]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. http://arxiv.org/abs/2308.10248 Activation addition: Steering language models without optimization
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. http://arxiv.org/abs/1909.08593 Fine-tuning language models from human preferences
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[28]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. 2023. http://arxiv.org/abs/2310.01405 Representation...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
-
[30]
Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=
work page 2023
-
[31]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=
work page 2023
-
[32]
Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , eprint=
work page 2023
-
[33]
Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=
work page 2022
-
[34]
Discovering Language Model Behaviors with Model-Written Evaluations , author=. 2022 , eprint=
work page 2022
-
[35]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[36]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell and Yuntao Bai and Anna Chen and Dawn Drain and Deep Ganguli and Tom Henighan and Andy Jones and Nicholas Joseph and Benjamin Mann and Nova DasSarma and Nelson Elhage and Zac Hatfield. A General Language Assistant as a Laboratory for Alignment , journal =. 2021 , url =. 2112.00861 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ B. Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and Shyamal Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen Creel and Jared Quincy Davis and D...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=
work page 2020
- [39]
- [40]
-
[41]
Simple synthetic data reduces sycophancy in large language models , author=. 2023 , eprint=
work page 2023
-
[42]
TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=
work page 2022
-
[43]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
- [44]
- [45]
- [46]
- [47]
-
[48]
Trenton Bricken and Adly Templeton and Joshua Batson and Brian Chen and Adam Jermyn and Tom Conerly and Nicholas L Turner and Cem Anil and Carson Denison and Amanda Askell and Robert Lasenby and Yifan Wu and Shauna Kravec and Nicholas Schiefer and Tim Maxwell and Nicholas Joseph and Alex Tamkin and Karina Nguyen and Brayden McLean and Josiah E Burke and T...
work page 2023
-
[49]
Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=
work page 2023
-
[50]
Demystifying Embedding Spaces using Large Language Models , author=. 2023 , eprint=
work page 2023
-
[51]
and Amodei, Dario and Sutskever, Ilya , title =
Radford, Alec and Wu, Jeff and Child, Rewon and Luan, D. and Amodei, Dario and Sutskever, Ilya , title =
-
[52]
Rawte, Vipula and Chakraborty, Swagata and Pathak, Agnibh and Sarkar, Anubhav and Islam Tonmoy, S.M Towhidul and Chadha, Aman and Sheth, Amit and Das, Amitava , title=. 2022 , eprint=
work page 2022
-
[53]
Linear Representations of Sentiment in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[54]
Extracting Latent Steering Vectors from Pretrained Language Models , author=. 2022 , eprint=
work page 2022
-
[55]
Inspecting and Editing Knowledge Representations in Language Models , author=. 2023 , eprint=
work page 2023
- [56]
-
[57]
Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
-
[58]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , journal =. 2019 , url =. 1912.01703 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[59]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =. 2019 , url =. 1910.03771 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[60]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , title =. CoRR , volume =. 2019 , url =. 1910.02054 , timestamp =
-
[61]
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering , author=. 2023 , eprint=
work page 2023
-
[62]
The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2023 , eprint=
work page 2023
-
[63]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei and Maarten Bosma and Vincent Y. Zhao and Kelvin Guu and Adams Wei Yu and Brian Lester and Nan Du and Andrew M. Dai and Quoc V. Le , title =. CoRR , volume =. 2021 , url =. 2109.01652 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[64]
Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=
Hackl, Veronika and Müller, Alexandra Elena and Granitzer, Michael and Sailer, Maximilian , year=. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings , volume=. doi:10.3389/feduc.2023.1272229 , journal=
-
[65]
Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022
work page 2022
-
[66]
A Systematic Survey of Text Worlds as Embodied Natural Language Environments
Jansen, Peter. A Systematic Survey of Text Worlds as Embodied Natural Language Environments. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.1
-
[67]
A Minimal Computational Improviser Based on Oral Thought
Montfort, Nick and Bartlett Fernandez, Sebastian. A Minimal Computational Improviser Based on Oral Thought. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.2
-
[68]
Volum, Ryan and Rao, Sudha and Xu, Michael and DesGarennes, Gabriel and Brockett, Chris and Van Durme, Benjamin and Deng, Olivia and Malhotra, Akanksha and Dolan, Bill. Craft an Iron Sword: Dynamically Generating Interactive Game Characters by Prompting Large Language Models Tuned on Code. Proceedings of the 3rd Wordplay: When Language Meets Games Worksho...
-
[69]
A Sequence Modelling Approach to Question Answering in Text-Based Games
Furman, Gregory and Toledo, Edan and Shock, Jonathan and Buys, Jan. A Sequence Modelling Approach to Question Answering in Text-Based Games. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.4
-
[70]
Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents
Teodorescu, Laetitia and Yuan, Xingdi and C \^o t \'e , Marc-Alexandre and Oudeyer, Pierre-Yves. Automatic Exploration of Textual Environments with Language-Conditioned Autotelic Agents. Proceedings of the 3rd Wordplay: When Language Meets Games Workshop (Wordplay 2022). 2022. doi:10.18653/v1/2022.wordplay-1.5
-
[71]
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022
work page 2022
-
[72]
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing
Yuan, Shuzhou and Maronikolakis, Antonis and Sch. Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.1
-
[73]
Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions
Ashida, Mana and Komachi, Mamoru. Towards Automatic Generation of Messages Countering Online Hate Speech and Microaggressions. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.2
-
[74]
G rease V ision: Rewriting the Rules of the Interface
Datta, Siddhartha and Kollnig, Konrad and Shadbolt, Nigel. G rease V ision: Rewriting the Rules of the Interface. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.3
-
[75]
Ludwig, Florian and Dolos, Klara and Zesch, Torsten and Hobley, Eleanor. Improving Generalization of Hate Speech Detection Systems to Novel Target Groups via Domain Adaptation. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.4
-
[76]
`` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch
Ruitenbeek, Ward and Zwart, Victor and Van Der Noord, Robin and Gnezdilov, Zhenja and Caselli, Tommaso. `` Zo Grof ! '' : A Comprehensive Corpus for Offensive and Abusive Language in D utch. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.5
-
[77]
Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts
Goffredo, Pierpaolo and Basile, Valerio and Cepollaro, Bianca and Patti, Viviana. Counter- TWIT : An I talian Corpus for Online Counterspeech in Ecological Contexts. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.6
-
[78]
S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes
Deshpande, Awantee and Ruiter, Dana and Mosbach, Marius and Klakow, Dietrich. S tereo KG : Data-Driven Knowledge Graph Construction For Cultural Knowledge and Stereotypes. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.7
-
[79]
Lu, Christina and Jurgens, David. The subtle language of exclusion: Identifying the Toxic Speech of Trans-exclusionary Radical Feminists. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.8
-
[80]
Lost in Distillation: A Case Study in Toxicity Modeling
Chvasta, Alyssa and Lees, Alyssa and Sorensen, Jeffrey and Vasserman, Lucy and Goyal, Nitesh. Lost in Distillation: A Case Study in Toxicity Modeling. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.9
-
[81]
Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words
Stamou, Vivian and Alexiou, Iakovi and Klimi, Antigone and Molou, Eleftheria and Saivanidou, Alexandra and Markantonatou, Stella. Cleansing & expanding the HURTLEX (el) with a multidimensional categorization of offensive words. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.10
-
[82]
Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler
Israeli, Abraham and Tsur, Oren. Free speech or Free Hate Speech? Analyzing the Proliferation of Hate Speech in Parler. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.11
-
[83]
Resources for Multilingual Hate Speech Detection
Arango Monnar, Ayme and Perez, Jorge and Poblete, Barbara and Salda \ n a, Magdalena and Proust, Valentina. Resources for Multilingual Hate Speech Detection. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.12
-
[84]
Enriching Abusive Language Detection with Community Context
Saleem, Haji Mohammad and Kurrek, Jana and Ruths, Derek. Enriching Abusive Language Detection with Community Context. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.13
-
[85]
DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis
Demus, Christoph and Pitz, Jonas and Sch. DeTox: A Comprehensive Dataset for G erman Offensive Language and Conversation Analysis. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.14
-
[86]
Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models
R. Multilingual H ate C heck: Functional Tests for Multilingual Hate Speech Detection Models. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.15
-
[87]
Distributional properties of political dogwhistle representations in S wedish BERT
Hertzberg, Niclas and Cooper, Robin and Lindgren, Elina and R. Distributional properties of political dogwhistle representations in S wedish BERT. Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH). 2022. doi:10.18653/v1/2022.woah-1.16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.