First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
C og S teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.
citing papers explorer
-
Adversarial Robustness of Activation Steering in Large Language Models
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
-
TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models
TALAN inserts a trainable latent memory path that remixes sequence information into small orthogonal perturbations, delivering 1.41-1.85 point average gains over matched LoRA and DoRA on four Qwen backbones and STEM/code benchmarks while adding under 1% parameters.