Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger · 2025 · arXiv 2511.01689

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

cs.AI · 2026-06-16 · unverdicted · novelty 6.0

Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.

When Role-playing, Do Models Believe What They Say?

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

Different persona induction methods produce a spectrum of belief internalization: prompting, ICL and SFT mainly alter outputs while Emergent Misalignment produces large representational shifts and Open Character Training produces smaller ones clearest in larger models.

Coherence Maximization Improves Pluralistic Alignment

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

ICM-inferred examples achieve gold-label performance across alignment benchmarks and generalize better when coherence is high even at fixed accuracy.

Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

cs.HC · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

LLM facilitation in group charity allocation leaves consensus and participation equity unchanged while shifting specific allocations up to 5.5 points and increasing perceived trust.

Prompt Governance? On Governing Technologies Governed by Natural Language

cs.CY · 2026-04-29 · unverdicted · novelty 4.0

Literature on system prompts for AI shows fragmented and contradictory claims that complicate policy efforts to use them as reliable governance mechanisms.

citing papers explorer

Showing 10 of 10 citing papers after filters.

Adversarial Robustness of Activation Steering in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 65
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games cs.AI · 2026-06-16 · unverdicted · none · ref 5
Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.
Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs cs.LG · 2026-06-10 · unverdicted · none · ref 140
Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 38
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 23
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy cs.CL · 2026-04-02 · unverdicted · none · ref 14
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
When Role-playing, Do Models Believe What They Say? cs.CL · 2026-06-09 · unverdicted · none · ref 19
Different persona induction methods produce a spectrum of belief internalization: prompting, ICL and SFT mainly alter outputs while Emergent Misalignment produces large representational shifts and Open Character Training produces smaller ones clearest in larger models.
Coherence Maximization Improves Pluralistic Alignment cs.CL · 2026-06-02 · unverdicted · none · ref 23
ICM-inferred examples achieve gold-label performance across alignment benchmarks and generalize better when coherence is high even at fixed accuracy.
Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task cs.HC · 2026-05-13 · unverdicted · none · ref 31 · 2 links
LLM facilitation in group charity allocation leaves consensus and participation equity unchanged while shifting specific allocations up to 5.5 points and increasing perceived trust.
Prompt Governance? On Governing Technologies Governed by Natural Language cs.CY · 2026-04-29 · unverdicted · none · ref 217
Literature on system prompts for AI shows fragmented and contradictory claims that complicate policy efforts to use them as reliable governance mechanisms.

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer