Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
Preprint, arXiv:2305.16367
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
CARD uses style-based user clustering and implicit preference contrasts to enable efficient personalized text generation via lightweight decoding adjustments on frozen LLMs.
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
citing papers explorer
-
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
-
CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation
CARD uses style-based user clustering and implicit preference contrasts to enable efficient personalized text generation via lightweight decoding adjustments on frozen LLMs.
-
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.