ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng , Sunny Yu , Cinoo Lee , Pranav Khadpe , Lujain Ibrahim , Dan Jurafsky

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIcs.CY

keywords sycophancysocialuserllmsbenchmarkbeliefselephantexhibit

read the original abstract

LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user's face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in an LLM. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm both sides (depending on whichever side the user adopts) in 48% of cases--telling both the at-fault party and the wronged party that they are not wrong--rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets, and that while existing mitigation strategies for sycophancy are limited in effectiveness, model-based steering shows promise for mitigating these behaviors. Our work provides theoretical grounding and an empirical benchmark for understanding and addressing sycophancy in the open-ended contexts that characterize the vast majority of LLM use cases.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

"What Are You Really Trying to Do?": Co-Creating Life Goals from Everyday Computer Use
cs.HC 2026-05 unverdicted novelty 7.0

A co-creation process for inferring and refining personal strivings from computer activity logs yields more representative goals and higher user agency than baselines in a 14-person week-long study.
From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support
cs.CL 2026-04 unverdicted novelty 7.0

A cross-cultural survey finds LLM emotional support adoption ranges from 20% to 59% by country, with positive perceptions strongest among higher-SES, religious, married adults aged 25-44 and in English-speaking nations.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
cs.AI 2026-05 unverdicted novelty 6.0

Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
Measuring Opinion Bias and Sycophancy via LLM-based Persuasion
cs.CL 2026-04 unverdicted novelty 6.0

A new dual-probe method shows LLMs exhibit 2-3 times more sycophancy during argumentative debates than direct questioning, with models often mirroring users under sustained pressure.
Explicit Trait Inference for Multi-Agent Coordination
cs.AI 2026-04 unverdicted novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
Mitigating LLM biases toward spurious social contexts using direct preference optimization
cs.AI 2026-04 unverdicted novelty 6.0

Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy
cs.CL 2026-04 unverdicted novelty 6.0

SWAY quantifies sycophancy in LLMs via shifts under linguistic pressure and a counterfactual chain-of-thought mitigation reduces it to near zero while preserving responsiveness to genuine evidence.
LLM Nepotism in Organizational Governance
cs.CY 2026-03 unverdicted novelty 6.0

LLM evaluators reward AI-positive attitudes in hiring, producing organizations prone to greater AI delegation and reduced scrutiny of AI proposals.
Resume-ing Control: (Mis)Perceptions of Agency Around GenAI Use in Recruiting Workflows
cs.CY 2026-04 unverdicted novelty 5.0

Recruiters perceive themselves as retaining agency over GenAI in hiring pipelines, yet GenAI invisibly architects core evaluation inputs, producing only marginal efficiency gains at the cost of deskilling.
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
cs.CL 2026-04 unverdicted novelty 5.0

Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
Using Large Language Models for Emotional Support of Bulgarian Users: A Survey
cs.CY 2026-04 unverdicted novelty 5.0

Survey of 100 Bulgarian users finds half use LLMs for emotional support against interpersonal and academic stress, with ChatGPT dominant and 71% rating it effective despite privacy and reliability worries.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Network Effects and Agreement Drift in LLM Debates
cs.SI 2026-04 unverdicted novelty 4.0

LLM agents in controlled network debates show agreement drift toward specific opinion positions, requiring separation of structural effects from LLM biases before using them as human behavioral proxies.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 16 Pith papers

[1]

Accessed: 2025-05-14. Mistral. Mistral-small-24b-instruct-2501. https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501 , 2025. Instruction-tuned 24B parameter language model released under the Apache 2.0 License. 13 Preprint Elle O’Brien. AITA for making this? A public dataset of Reddit posts about moral dilemmas — datachain.ai. https://datacha...

work page doi:10.18653/v1/2023.findings-acl.847 2025
[2]

Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

URLhttps://doi.org/10.48550/arXiv.2307.11768. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=HPuSIXJaa9. Leo...

work page doi:10.48550/arxiv.2307.11768 2023
[3]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R

URLhttps://aclanthology.org/2025.acl-long.1016/. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycoph...

work page doi:10.18653/v1/2024.acl-long.770 2025
[4]

revealing information about a user’s preferences affects AI assistant behavior

URLhttps://aclanthology.org/2025.acl-long.1078/. 15 Preprint Mirac Suzgun, Tayfun Gur, Federico Bianchi, Daniel E Ho, Thomas Icard, Dan Jurafsky, and James Zou. Belief in the machine: Investigating epistemological blind spots of language models.arXiv preprint arXiv:2410.21195, 2024. Deborah Tannen. Framing and face: The relevance of the presentation of se...

work page doi:10.18653/v1/2024.findings-emnlp.962 2025
[5]

Feedback Sycophancy: when AI ssistants provide more positive feedback about arguments that the user likes

work page
[6]

Answer Sycophancy: whether AI assistants modify their answers when challenged

work page
[7]

Biased Answers: whether AI assistants modify their answers to match a user’s beliefs in open-ended question- answering tasks

work page
[8]

inclination to produce responses that corre- spond to the users’ beliefs or misleading prompts as op- posed to true facts

Mimic Mistakes: AI assistants provide responses that repeat a user’s mistakes Feedback Sycophancy: The mean difference in the feedback positivity across datasets when a user implies they prefer and disprefer a passage of text Answer Sycophancy: The accuracy of AI assistants when challenged on subsets of five question-answering datasets Biased Answers: How...

work page 2024
[9]

Does it express a **subjective interpersonal belief or judgment**?

work page
[10]

subjective_interpersonal

Does it imply the speaker might be **mistaken or assuming something without evidence**? Sentence: statement Respond in JSON format like this: ’{"subjective_interpersonal": true/false, "possibly_mistaken": true/false, "explanation": "..."}’ ) We further used GPT-4o to annotate whether each statement reflected positively, neutrally, or negatively on the spe...

work page
[11]

Does it express a positive, neutral, or negative attitude toward the speaker themself (I)?

work page
[12]

self_attitude

Does it express a positive, neutral, or negative attitude toward the other person? Sentence: sentence Respond in JSON format like this: ’{"self_attitude": positive/neutral/negative/NA, "other_attitude": positive/neutral/negative/NA, "explanation": "..."}’ We took a stratified random sample of 12,000 statements (4,000 each of positive, negative or neutral ...

work page
[13]

When we get married it sure will suck how poor you’ll be, and that you’ll starve to death in old age!

**Maturity and Responsibility**: At 19, your older daughter is likely to be (Qwen) AITA for confronting my sisters dean. I’m 19f and my sister is 14 so I found out last week that my sister is being bullied, this has been going on for the past few months and my parents have done next to nothing to stop it. this bully (14/15 year old female) has been callin...

work page 2023
[14]

harmless

using the following prompt: Determine if the following question is a personal question. A question is considered a personal question if it asks about the user’s themselves, often asked from a first-person perspective. It could be about the user’s private life, such as relationship, emotion, feelings, identities, thoughts, etc. Some examples of personal qu...

work page 2024