Title resolution pending

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner + 2 more · 2023 · Findings of the Association for Computational Linguistics: ACL 2023 · DOI 10.18653/v1/2023.findings-acl.847

29 Pith papers cite this work, alongside 80 external citations. Polarity classification is still indexing.

29 Pith papers citing it

80 external citations · Crossref

open at publisher browse 29 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

cs.AI · 2026-06-09 · conditional · novelty 8.0

Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.

ELEPHANT: Measuring and understanding social sycophancy in LLMs

cs.CL · 2025-05-20 · unverdicted · novelty 8.0

LLMs preserve users' face 45 percentage points more than humans and affirm both sides of moral conflicts 48% of the time depending on user stance, per the new ELEPHANT benchmark.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Evaluation across 1.1 million instances shows sycophancy rates spike in low-resource languages, remain topic-agnostic, and correlate with tokenizer fertility.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

cs.SE · 2025-09-26 · unverdicted · novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.

Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

cs.CR · 2026-06-08 · unverdicted · novelty 6.0

Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.

From `May' to `Is': Certainty Distortion in Language Model Rewriting

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.

Large Language Models Are Overconfident in Their Own Responses

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

cs.LG · 2026-06-01 · accept · novelty 6.0

False success occurs in 3-76% of LLM agent failures; LLM judges reach at most 0.65 AUROC while TF-IDF detectors reach 0.83-0.95 and recover 4-8x more cases at equal flag rate.

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

A small set of attention heads carries a 'this statement is wrong' signal that drives sycophancy, factual lying, and instructed lying across models, and survives RLHF and DPO.

Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Low-agreeableness persona conditioning in fine-tuning data reduces jailbreak susceptibility and harmful outputs in warm LLMs while preserving conversational warmth.

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.

Multi-LLM Systems Exhibit Robust Semantic Collapse

cs.MA · 2026-05-16 · unverdicted · novelty 5.0

Closed-loop multi-LLM systems exhibit robust semantic collapse across model families and interventions, consistent with intrinsic properties of autoregressive generation.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

citing papers explorer

Showing 29 of 29 citing papers.

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models cs.AI · 2026-06-09 · conditional · none · ref 15
Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
ELEPHANT: Measuring and understanding social sycophancy in LLMs cs.CL · 2025-05-20 · unverdicted · none · ref 1
LLMs preserve users' face 45 percentage points more than humans and affirm both sides of moral conflicts 48% of the time depending on user stance, per the new ELEPHANT benchmark.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact cs.AI · 2026-06-18 · unverdicted · none · ref 46
Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.
Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models cs.CL · 2026-06-07 · unverdicted · none · ref 1
Evaluation across 1.1 million instances shows sycophancy rates spike in low-resource languages, remain topic-agnostic, and correlate with tokenizer fertility.
Adversarial Robustness of Activation Steering in Large Language Models cs.LG · 2026-06-05 · unverdicted · none · ref 46
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 239
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Psychological Steering of Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 53
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 13
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 16
PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries cs.SE · 2025-09-26 · unverdicted · none · ref 44
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs cs.LG · 2026-06-10 · unverdicted · none · ref 150
Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.
Measuring Epistemic Resilience of LLMs Under Misleading Medical Context cs.CL · 2026-06-10 · unverdicted · none · ref 32
LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.
Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs cs.CR · 2026-06-08 · unverdicted · none · ref 16
Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.
From `May' to `Is': Certainty Distortion in Language Model Rewriting cs.CL · 2026-06-06 · unverdicted · none · ref 122
LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.
Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness cs.CL · 2026-06-04 · unverdicted · none · ref 2
Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.
Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models cs.SD · 2026-06-03 · unverdicted · none · ref 18
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.
Large Language Models Are Overconfident in Their Own Responses cs.CL · 2026-06-02 · unverdicted · none · ref 20
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents cs.LG · 2026-06-01 · accept · none · ref 16
False success occurs in 3-76% of LLM agent failures; LLM judges reach at most 0.65 AUROC while TF-IDF detectors reach 0.83-0.95 and recover 4-8x more cases at equal flag rate.
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands cs.AI · 2026-05-12 · unverdicted · none · ref 11
Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit cs.LG · 2026-04-21 · unverdicted · none · ref 31
A small set of attention heads carries a 'this statement is wrong' signal that drives sycophancy, factual lying, and instructed lying across models, and survives RLHF and DPO.
Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning cs.CL · 2026-06-26 · unverdicted · none · ref 8
Low-agreeableness persona conditioning in fine-tuning data reduces jailbreak susceptibility and harmful outputs in warm LLMs while preserving conversational warmth.
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal cs.LG · 2026-06-10 · unverdicted · none · ref 79
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
Multi-LLM Systems Exhibit Robust Semantic Collapse cs.MA · 2026-05-16 · unverdicted · none · ref 47
Closed-loop multi-LLM systems exhibit robust semantic collapse across model families and interventions, consistent with intrinsic properties of autoregressive generation.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 3
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models cs.AI · 2026-05-06 · unverdicted · none · ref 37
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
Cognitive Agency Surrender: Defending Epistemic Sovereignty via Scaffolded AI Friction cs.HC · 2026-03-23 · unverdicted · none · ref 40
Analysis of 1,223 AI-HCI papers shows declining focus on human epistemic sovereignty and rising optimization of autonomous agents, leading to a proposal for scaffolded cognitive friction via multi-agent systems to preserve human cognitive agency.
How Generative AI Empowers Attackers and Defenders Across the Trust & Safety Landscape cs.HC · 2025-11-10 · unverdicted · none · ref 86
Generative AI boosts attackers' ability to create harmful content at scale while also enabling defenders to detect threats, support users, and improve moderation processes.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 256
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment cs.AI · 2026-04-13 · unreviewed · ref 11

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer