Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Liu, Guangliang, Mao, Haitao, Tang, Jiliang, Johnson, Kristen · 2024 · DOI 10.18653/v1/2024.emnlp-main.918

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations

cs.SI · 2026-05-30 · unverdicted · novelty 4.0

ReBias-Lens shows LLM self-reflection produces layer-wise smoothing of global valence fluctuations that reduces behavioral bias overall, yet selectively locks in and amplifies certain category-specific biases.

citing papers explorer

Showing 1 of 1 citing paper.

Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations cs.SI · 2026-05-30 · unverdicted · none · ref 36
ReBias-Lens shows LLM self-reflection produces layer-wise smoothing of global valence fluctuations that reduces behavioral bias overall, yet selectively locks in and amplifies certain category-specific biases.

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

fields

years

verdicts

representative citing papers

citing papers explorer