arXiv preprint arXiv:2410.16314 , year=

Steering large language models using conceptors: Improving addition-based activation engineering , author= · 2025 · arXiv 2410.16314

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Decomposing how prompting steers behavior

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.

$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors

stat.ML · 2026-05-15 · unverdicted · novelty 7.0

α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

GCAD reduces coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1 in persona-steering tasks by using gated attention-delta interventions from system prompts.

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.

Temporal Preference Concepts and their Functions in a Large Language Model

cs.LG · 2026-05-11 · unverdicted · novelty 5.0

Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 1 of 1 citing paper after filters.

$\alpha$-TCAV: A Unified Framework for Testing with Concept Activation Vectors stat.ML · 2026-05-15 · unverdicted · none · ref 131
α-TCAV replaces TCAV's hard indicator with a tunable smooth function to create a unified probabilistic framework with lower variance and guidance for parameter choice or Bayes-optimal scoring.

arXiv preprint arXiv:2410.16314 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer