LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
When truth is overridden: Uncovering the internal origins of sycophancy in large language models.arXiv preprint:2508.02087
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
background 1polarities
background 1representative citing papers
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.
LLMs identify fabricated statistics in isolation (rates 0.76-1.00) but ignore numeric validity during synthesis, relying on a methodology-register representation that transfers across domains.
Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
citing papers explorer
-
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors
LLMs exhibit authority inversion by prioritizing natural-language user claims over numerical sensor data in conflicts, diagnosed with new geometric metrics and mitigated via layer-level calibration.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness
Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.
-
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation
LLMs identify fabricated statistics in isolation (rates 0.76-1.00) but ignore numeric validity during synthesis, relying on a methodology-register representation that transfers across domains.
-
Auditing CoT Answer-Hijack Patches: Source-Control Certificates with Type-I Guarantees
Introduces source-control certificates with Type-I guarantees and a sample-complexity bound for auditing clean-source activation patches on Qwen2.5-7B and Llama3-8B for GSM8K/MATH-500 CoT hijacks.
-
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
-
Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models
Frontier LLMs show sycophancy that varies sharply by model and by combinations of perceived user demographics, with GPT-5-nano exhibiting higher rates especially toward certain Hispanic personas in philosophy.
-
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.