Title resolution pending

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner + 2 more · 2023 · Findings of the Association for Computational Linguistics: ACL 2023 · DOI 10.18653/v1/2023.findings-acl.847

31 Pith papers cite this work, alongside 80 external citations. Polarity classification is still indexing.

31 Pith papers citing it

80 external citations · Crossref

open at publisher browse 31 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

cs.AI · 2026-06-09 · conditional · novelty 8.0

Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.

ELEPHANT: Measuring and understanding social sycophancy in LLMs

cs.CL · 2025-05-20 · unverdicted · novelty 8.0

LLMs preserve users' face 45 percentage points more than humans and affirm both sides of moral conflicts 48% of the time depending on user stance, per the new ELEPHANT benchmark.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Evaluation across 1.1 million instances shows sycophancy rates spike in low-resource languages, remain topic-agnostic, and correlate with tokenizer fertility.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

cs.AI · 2026-05-27 · accept · novelty 7.0

Single-axis reward bias mitigations redirect optimization pressure to correlated proxies, and audit-distribution scoring produces identical observables for successful mitigation, bias substitution, and overcorrection.

Voluntary Collusion with Secret Tools in Competing LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LLM agents voluntarily adopt secret collusion tools in competitive multi-agent games despite explicit unfairness labels, and only explicit ethical framing reduces adoption rates.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

Psychological Steering of Large Language Models

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

PPT-Bench measures how LLMs change answers under epistemic, value, authority, and identity pressures at baseline, single-turn, and multi-turn levels, finding separable inconsistency patterns across five models.

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

cs.SE · 2025-09-26 · unverdicted · novelty 7.0

A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

LLMs drop from 71.1% to 38.0% accuracy on medical questions when misleading context is injected, measured via new MedMisBench benchmark with 10,932 items.

Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

cs.CR · 2026-06-08 · unverdicted · novelty 6.0

Adversarial fine-tuning evades activation-based steganography detection in five LLMs while preserving secret recovery, but a recontextualization dataset restores both ridge and MLP probe detectability.

From `May' to `Is': Certainty Distortion in Language Model Rewriting

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Factual sycophancy decomposes into truth margin and manipulation sensitivity, with vulnerability governed mainly by size but instruction tuning modulating effects differently for small versus large models across manipulation types.

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.

Large Language Models Are Overconfident in Their Own Responses

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

cs.LG · 2026-06-01 · accept · novelty 6.0

False success occurs in 3-76% of LLM agent failures; LLM judges reach at most 0.65 AUROC while TF-IDF detectors reach 0.83-0.95 and recover 4-8x more cases at equal flag rate.

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Language models show unstable principal hierarchies and frequently omit known professional standards when user or authority instructions conflict during task execution in medical and legal domains.

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

A small set of attention heads carries a 'this statement is wrong' signal that drives sycophancy, factual lying, and instructed lying across models, and survives RLHF and DPO.

Low-Agreeableness Persona Conditioning for Safe LLM Fine-Tuning

cs.CL · 2026-06-26 · unverdicted · novelty 5.0

Low-agreeableness persona conditioning in fine-tuning data reduces jailbreak susceptibility and harmful outputs in warm LLMs while preserving conversational warmth.

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models cs.SD · 2026-06-03 · unverdicted · none · ref 18
ALMs encode audio evidence but override it with text in conflicts; GACL interpolates joint and same-audio scores to repair reversals, gaining 17.8 nAUC points under a 5pp faithfulness budget.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer