Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314

Joris Postmus, Steven Abreu · 2024 · arXiv 2410.16314

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 3 of 3 citing papers.

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions cs.CL · 2026-05-11 · unverdicted · none · ref 12 · 2 links
GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning cs.CL · 2026-04-21 · unverdicted · none · ref 25
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 242
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Steering large language models using conceptors: Improving addition-based activation engineering.arXiv preprint arXiv:2410.16314

fields

years

verdicts

representative citing papers

citing papers explorer