Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Open Character Training : Shaping the Persona of AI Assistants through Constitutional AI , November 2025
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
citing papers explorer
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
-
Probing Persona-Dependent Preferences in Language Models
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
-
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
- Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task