Towards evaluating AI systems for moral status using self-reports

URL http://arxiv · 2025 · arXiv 2311.08576

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

cs.CL · 2026-04-01 · unverdicted · novelty 6.0

A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

cs.CL · 2026-03-17 · unverdicted · novelty 6.0

Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.

LLM Evaluators Recognize and Favor Their Own Generations

cs.CL · 2024-04-15 · unverdicted · novelty 6.0

LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

citing papers explorer

Showing 3 of 3 citing papers.

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models cs.CL · 2026-04-01 · unverdicted · none · ref 23
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious cs.CL · 2026-03-17 · unverdicted · none · ref 7
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
LLM Evaluators Recognize and Favor Their Own Generations cs.CL · 2024-04-15 · unverdicted · none · ref 21
LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.

Towards evaluating AI systems for moral status using self-reports

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer