hub Mixed citations

Steering Llama 2 via Contrastive Activation Addition , url =

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Turner · 2024 · Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) · DOI 10.18653/v1/2024.acl-long.828

Mixed citation behavior. Most common role is background (40%).

51 Pith papers citing it

23 external citations · Crossref

Background 40% of classified citations

open at publisher browse 51 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 2 baseline 1 support 1 use method 1

representative citing papers

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.

Predicting Future Behaviors in Reasoning Models Enables Better Steering

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

Probes predicting future behaviors from intermediate steps enable Future Probe Controlled Generation for steering large reasoning models with minimal quality degradation.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

Query Lens extends Logit Lens to interpret sparse features via key-value analysis and indirect effects, yielding coherent token signatures where Logit Lens fails, and proposes the Subspace Channel Hypothesis.

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

Steer Like the LLM: Activation Steering that Mimics Prompting

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

PSR models that estimate token-specific steering coefficients from activations outperform standard activation steering and compare favorably to prompting on steering benchmarks.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

On the Limits of Steering Vectors for Preference-Aligned Generation

cs.CL · 2026-07-02 · unverdicted · novelty 6.0

Empirical evaluation on the PLUME benchmark shows steering vectors vary widely in trait expressibility, degrade on task transfer, and lose effectiveness when multiple vectors are composed.

Mechanistically Eliciting Latent Behaviors in Language Models

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

CPE is an unsupervised tensor-decomposition method that finds interpretable LoRAs to surface hidden LLM behaviors, matching supervised methods on some tasks and revealing failure modes like sandbagging and alignment-faking.

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Valence is geometrically encoded in Apertus-8B and Gemma-4-E4B with PC1 correlations of 0.76 and 0.83, but emerges at different depths than in Claude and arousal alignment varies by generated corpus.

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.

Inside the LLM Word Factory

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RCA is a training-free module that boosts input context signal strength in the residual stream of LLMs by orthogonal decoupling of attention routing from value magnitude.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Latent-space Attacks for Refusal Evasion in Language Models cs.AI · 2026-05-20 · unverdicted · none · ref 12 · 2 links
Refusal suppression via difference-in-means ablation equals projection onto a linear probe's decision boundary, and a controlled evasion attack optimizing confidence past the boundary achieves SOTA success rates on 15 models.
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction cs.AI · 2026-05-18 · unverdicted · none · ref 36
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
Do Linear Probes Generalize Better in Persona Coordinates? cs.AI · 2026-05-10 · unverdicted · none · ref 71
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
A Geometric Account of Activation Steering through Angle-Norm Decomposition cs.AI · 2026-06-04 · unverdicted · none · ref 1
Empirical study across seven language models finds concepts represented primarily in angular structure of activations while norm affects steering stability, recommending separate angular and radial parameterization over single additive coefficients.

Steering Llama 2 via Contrastive Activation Addition , url =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer