Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
hub Mixed citations
Refusal in Language Models Is Mediated by a Single Direction
Mixed citation behavior. Most common role is background (67%).
abstract
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.
LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.
No tested model showed robust format-independent refusal on biosecurity hazards; a new divergence score between behavioral labels and SAE activations separated responses in one preliminary case.
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.
Introduces CAZ framework using Separation, Coherence, and Velocity metrics to identify depth regions of concept allocation, with empirical tests across 34 models showing multimodal separation curves and causally active gentle CAZes.
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
LLMs show three distinct non-sycophantic responses to science skepticism, with robustness in some cases being accidental because the model does not represent the skepticism signal, as determined by linear probes on three models in three domains.
STEER is a gradient-guided attack that iteratively translates refusal-triggering words into low-resource languages to jailbreak LLMs, reaching 93-96.7% success on open models and 35.5% transfer to GPT-4o-mini.
Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.
Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.
citing papers explorer
-
Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism
LLMs show three distinct non-sycophantic responses to science skepticism, with robustness in some cases being accidental because the model does not represent the skepticism signal, as determined by linear probes on three models in three domains.