Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.
super hub Mixed citations
Representation Engineering: A Top-Down Approach to AI Transparency
Mixed citation behavior. Most common role is background (62%).
abstract
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and con
authors
co-cited works
representative citing papers
Deceptive forward passes show 2.1-2.3x higher residual rank than naive-liar passes on identical wrong answers, enabling label-free lie identification at 100% accuracy across GPT-2, Qwen, and Phi models with cross-family and cross-language transfer.
Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
Zero-dimensional persistent homology on transformer layer hidden states yields three descriptors per layer whose concatenation improves ill-posedness classification and enables topology-conditioned activation steering across three LLMs.
Replay pairing shows LLM agents do not persist plans in hidden states but rely on plans remaining in context, with rapid signal decay and task performance drops when plans are evicted.
Hidden-state convergence at step 4 predicts behavioral consistency in LLM agents on QA tasks (r=-0.35 to -0.83), enabling AUROC 0.97 detection of inconsistent trajectories but not improving accuracy on harder benchmarks.
High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
10.3-22.9% of pass@k=0 math examples across GSM8K and MATH are recovered by a deterministic six-chain regime using activation grafting, showing a sampling blind spot in difficulty estimation.
FloatDoor uses two LoRA adapters to create the first input-independent backdoor that triggers adversary-chosen behavior only on a target platform while remaining benign elsewhere.
Cosine-scored SAEs with a learned direction-magnitude blend learn more concept-aligned features than standard inner-product SAEs at matched reconstruction quality.
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
citing papers explorer
-
The Linear Representation Hypothesis and the Geometry of Large Language Models
Linear representations of high-level concepts in LLMs are formalized via counterfactuals in input and output spaces, unified under a causal inner product that enables consistent probing and steering.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.