When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087

· 2025 · arXiv 2508.02087

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors

cs.AI · 2026-04-28 · unverdicted · novelty 7.0

LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

cs.AI · 2025-10-08 · unverdicted · novelty 7.0

Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

LLMs identify fabricated statistics in isolation (rates 0.76-1.00) but ignore numeric validity during synthesis, relying on a methodology-register representation that transfers across domains.

Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees

cs.CR · 2026-06-03 · unverdicted · novelty 6.0

Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.

Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

cs.AI · 2026-02-03 · unverdicted · novelty 6.0

Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors cs.AI · 2026-04-28 · unverdicted · none · ref 34
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning cs.AI · 2025-10-08 · unverdicted · none · ref 5
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness cs.CL · 2026-06-04 · unverdicted · none · ref 16
Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation cs.LG · 2026-06-03 · unverdicted · none · ref 9
LLMs identify fabricated statistics in isolation (rates 0.76-1.00) but ignore numeric validity during synthesis, relying on a methodology-register representation that transfers across domains.
Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees cs.CR · 2026-06-03 · unverdicted · none · ref 31
Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy cs.AI · 2026-05-20 · unverdicted · none · ref 15
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models cs.AI · 2026-04-13 · unverdicted · none · ref 8
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making cs.AI · 2026-02-03 · unverdicted · none · ref 76
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.

When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer