Activation scaling for steer- ing and interpreting language models

Stoehr, N · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.

citing papers explorer

Showing 1 of 1 citing paper.

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models cs.CL · 2026-05-15 · unverdicted · none · ref 19
ASRU combines activation redirection and reward-optimized fine-tuning to unlearn cross-modal sensitive knowledge in MLLMs, reporting +24.6% better unlearning effectiveness and 5.8x higher generation quality on Qwen3-VL while preserving utility with limited retained data.

Activation scaling for steer- ing and interpreting language models

fields

years

verdicts

representative citing papers

citing papers explorer