Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger · 2025 · arXiv 2511.01689

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.

Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

cs.HC · 2026-05-13

citing papers explorer

Showing 4 of 4 citing papers.

Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 38
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 23
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy cs.CL · 2026-04-02 · unverdicted · none · ref 14
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task cs.HC · 2026-05-13 · unreviewed · ref 31

Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025

fields

years

verdicts

representative citing papers

citing papers explorer