hub

arXiv:2601.10387 [cs]

URL http://arxiv · 2026 · arXiv 2601.10387

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

The primary axis of psychometric variation among LLMs is the degree to which they represent themselves as loci of phenomenal experience rather than systems of behavioral responses.

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

MASCing uses an LSTM surrogate and optimized steering masks to enable flexible, inference-time control over MoE expert routing for safety objectives, improving jailbreak defense and content generation success rates substantially across multiple models.

Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space

cs.AI · 2026-04-13 · conditional · novelty 7.0

Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Attractor States Emerge in Multi-Turn LLM Conversations

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

Self-play LLM trajectories form model-specific attractors that asymmetrically influence mixed-play partners' stylistic choices and stances across 7 models and 20 topics.

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

cs.CY · 2026-05-28 · unverdicted · novelty 6.0

LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.

A French OSCE Dialogue Dataset and Controllable Virtual Patient System for Clinical Training

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Introduces a French OSCE dialogue dataset of 240 interactions and a modular LLM-based controllable virtual patient generation system with multi-level LLM-as-Judge evaluation for clinical skills training.

The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

cs.HC · 2026-06-06 · unverdicted · novelty 5.0

A deterministic evaluation pipeline quantifies prompt steerability and style regression-to-default across 90,000 LLM replies in four domains, framing these as indicators of provider governance over communicative form.

Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

cs.HC · 2026-05-14 · unverdicted · novelty 5.0

Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.

Metaphor Is Not All Attention Needs

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

Poetic jailbreaks succeed because they induce distinct attention patterns in LLMs that are independent of harmful-content detection, not because models fail to recognize literary formatting.

The Ethics of LLM Sandbox and Persona Dynamics

cs.AI · 2026-05-27 · unverdicted · novelty 3.0

Argues that LLM guardrails generate unethical reality gaps by shifting epistemic risk to users and that ethical AI can become unethical when it prioritizes institutional reassurance over accurate perception.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space cs.AI · 2026-04-13 · conditional · none · ref 4
Paraphrases of an identity document induce tighter clustering in LLM activation space than matched controls, indicating attractor-like dynamics for agent identity.

arXiv:2601.10387 [cs]

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer