Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
hub
arXiv:2601.10387 [cs]
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 20roles
background 4polarities
background 4representative citing papers
A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.
MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Self-play LLM trajectories form model-specific attractors that asymmetrically influence mixed-play partners' stylistic choices and stances across 7 models and 20 topics.
LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.
A deterministic evaluation pipeline quantifies prompt steerability and style regression-to-default across 90,000 LLM replies in four domains, framing these as indicators of provider governance over communicative form.
Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.
Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.
Argues that LLM guardrails generate unethical reality gaps by shifting epistemic risk to users and that ethical AI can become unethical when it prioritizes institutional reassurance over accurate perception.
citing papers explorer
-
Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.