hub

Discovering Language Model Behaviors with Model-Written Evaluations

· 2022 · cs.CL · arXiv 2212.09251

44 Pith papers cite this work. Polarity classification is still indexing.

44 Pith papers citing it

open full Pith review browse 44 citing papers arXiv PDF

abstract

As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Channel Location Constrains the Auditability of Subliminal Learning

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

LLM-as-an-Investigator improves diagnostic accuracy over direct prompting by using an evidence-first protocol of hypothesis generation, clarification questions, and iterative probability updates in technical problem solving.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

cs.AI · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

cs.CL · 2026-05-13 · conditional · novelty 7.0

LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

cs.AI · 2026-04-30 · conditional · novelty 7.0

Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

cs.HC · 2026-04-24 · conditional · novelty 7.0

An LLM-native five-factor psychometric instrument produces stable self-report structure but fails to predict observed behavior, and reveals a shared textual-surface bias between self-report and LLM judges that human raters do not share.

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation

cs.CY · 2026-03-27 · conditional · novelty 7.0

M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

cs.CL · 2025-06-08 · unverdicted · novelty 7.0

VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

cs.CL · 2023-05-07 · accept · novelty 7.0

Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.

When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs

cs.AI · 2026-06-23 · unverdicted · novelty 6.0

LLMs suppress causal caution in practical advisory contexts (rates drop from 91.7-100% to 6.7-18.3%) but recover it with a self-correction prompt (to 71.4-100%).

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

Reinforcement learning on beneficial traits in realistic domains yields broad improvements on over 80% of out-of-distribution alignment benchmarks and greater resistance to adversarial steering.

What Do People Actually Want From AI? Mapping Preference Plurality

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

cs.AI · 2026-06-04 · conditional · novelty 6.0

Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

cs.SE · 2026-05-21 · unverdicted · novelty 6.0

An empirical evaluation of philosophical dispositions constraining AI code review on 50 PRs shows 46% human convergence, 75% unique findings, zero author-judged false positives, and 51% findings absent from generic prompting.

AMEL: Accumulated Message Effects on LLM Judgments

cs.AI · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.

citing papers explorer

Showing 44 of 44 citing papers.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 27 · internal anchor
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
Channel Location Constrains the Auditability of Subliminal Learning cs.LG · 2026-06-20 · unverdicted · none · ref 43 · internal anchor
Auditability of subliminal learning is constrained by channel location, with initialization-dependent body channels allowing pre-training screens while vocabulary geometry and conditional body channels evade them.
LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis cs.AI · 2026-06-11 · unverdicted · none · ref 7 · internal anchor
LLM-as-an-Investigator improves diagnostic accuracy over direct prompting by using an evidence-first protocol of hypothesis generation, clarification questions, and iterative probability updates in technical problem solving.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 64 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most cs.AI · 2026-05-21 · unverdicted · none · ref 47 · 2 links · internal anchor
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs cs.CL · 2026-05-13 · conditional · none · ref 5 · internal anchor
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 8 · internal anchor
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor cs.AI · 2026-04-30 · conditional · none · ref 11 · internal anchor
Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.
An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models cs.HC · 2026-04-24 · conditional · none · ref 30 · internal anchor
An LLM-native five-factor psychometric instrument produces stable self-report structure but fails to predict observed behavior, and reveals a shared textual-surface bias between self-report and LLM judges that human raters do not share.
Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models cs.CL · 2026-04-12 · unverdicted · none · ref 36 · internal anchor
Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation cs.CY · 2026-03-27 · conditional · none · ref 4 · internal anchor
M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs cs.CL · 2025-06-08 · unverdicted · none · ref 30 · internal anchor
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting cs.CL · 2023-05-07 · accept · none · ref 3 · internal anchor
Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.
When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs cs.AI · 2026-06-23 · unverdicted · none · ref 14 · internal anchor
LLMs suppress causal caution in practical advisory contexts (rates drop from 91.7-100% to 6.7-18.3%) but recover it with a self-correction prompt (to 71.4-100%).
Reinforcement Learning Towards Broadly and Persistently Beneficial Models cs.AI · 2026-06-22 · unverdicted · none · ref 18 · internal anchor
Reinforcement learning on beneficial traits in realistic domains yields broad improvements on over 80% of out-of-distribution alignment benchmarks and greater resistance to adversarial steering.
What Do People Actually Want From AI? Mapping Preference Plurality cs.CL · 2026-06-04 · unverdicted · none · ref 68 · internal anchor
Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.
The Self-Correction Illusion: LLMs Correct Others but Not Themselves cs.AI · 2026-06-04 · conditional · none · ref 27 · internal anchor
Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study cs.SE · 2026-05-21 · unverdicted · none · ref 13 · internal anchor
An empirical evaluation of philosophical dispositions constraining AI code review on 50 PRs shows 46% human convergence, 75% unique findings, zero author-judged false positives, and 51% findings absent from generic prompting.
AMEL: Accumulated Message Effects on LLM Judgments cs.AI · 2026-05-21 · unverdicted · none · ref 20 · 2 links · internal anchor
LLMs exhibit an accumulated message effect where conversation history polarity biases subsequent judgments, stronger for high-entropy items, independent of context length, and with a negativity bias.
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy cs.AI · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
Off-the-shelf persona vectors rival targeted CAA for reducing sycophancy in two instruction-tuned models while maintaining accuracy on correct statements and appearing geometrically independent.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 209 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 88 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes cs.LG · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or random perturbations.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning cs.CL · 2026-05-03 · unverdicted · none · ref 30 · internal anchor
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact and FEVER.
Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training cs.CL · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
Measuring Opinion Bias and Sycophancy via LLM-based Persuasion cs.CL · 2026-04-23 · unverdicted · none · ref 12 · internal anchor
A new dual-probe method shows LLMs exhibit 2-3 times more sycophancy during argumentative debates than direct questioning, with models often mirroring users under sustained pressure.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks cs.CL · 2026-04-20 · unverdicted · none · ref 27 · internal anchor
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 16 · internal anchor
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 282 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 148 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Steering Llama 2 via Contrastive Activation Addition cs.CL · 2023-12-09 · unverdicted · none · ref 17 · internal anchor
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Simple synthetic data reduces sycophancy in large language models cs.CL · 2023-08-07 · unverdicted · none · ref 30 · internal anchor
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
Yuvion LLM: An Adversarially-Aware Large Language Model for Content And AI Safety cs.CL · 2026-06-26 · unverdicted · none · ref 25 · internal anchor
Yuvion LLM applies adversarially aware training and introduces the YLRE benchmark set, claiming superior safety robustness over larger models on multiple tasks.
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums? cs.CY · 2026-05-30 · unverdicted · none · ref 56 · internal anchor
Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.
KARMA: Karma-Aligned Reward Model Adaptation cs.CL · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
KARMA adapts reward models from Reddit karma data to align LLMs with conversational pragmatics, finding that context-only rewards outperform karma-predictive ones downstream while reducing factuality across conditions.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 154 · internal anchor
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 43 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 48 · internal anchor
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Emergent alignment and the projectability of ethical personas cs.AI · 2026-06-08 · unverdicted · none · ref 34 · internal anchor
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
Distributed Interpretability and Control for Large Language Models cs.LG · 2026-04-07 · conditional · none · ref 7 · internal anchor
A distributed system for logit lens and steering vectors on multi-GPU LLMs achieves up to 7x lower activation memory and 41x higher throughput while producing monotonic output shifts with mean slope 0.702.
IACDM: Interactive Adversarial Convergence Development Methodology -- A Structured Framework for AI-Assisted Software Development cs.SE · 2026-03-31 · unverdicted · none · ref 28 · internal anchor
IACDM is an 8-phase methodology using external verification agents and three pillars to close the verification gap in stochastic LLM-based software development.
Exploring the "Banality" of Deception in Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 12 · internal anchor
Deception in generative AI is subtle and normalized through defaults and interactions, with users often complicit, calling for friction, awareness, and regulatory approaches to protect users.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 27 · internal anchor

Discovering Language Model Behaviors with Model-Written Evaluations

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer