Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
RepIt creates semantic backdoors in frontier language models by steering refusal vectors for specific concepts, allowing targeted unsafe responses while preserving safe scores on standard benchmarks.
citing papers explorer
-
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
Activation Steering with a Feedback Controller
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
-
Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs
Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.
-
Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
-
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.
-
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer decomposes inference-time steering into Subspace-guided Conditional Steering and Mixture-of-Steering-Experts to deliver stronger control over LLM behaviors with less utility loss than prior methods.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
-
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
RepIt creates semantic backdoors in frontier language models by steering refusal vectors for specific concepts, allowing targeted unsafe responses while preserving safe scores on standard benchmarks.
- Understanding LoRA as Knowledge Memory: An Empirical Analysis