S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Zhang, Bo, Gao, Cong, Yang, Linkang, Han, Bingxu, Hu, Minghao, Luo, Zhunchen · 2025 · DOI 10.18653/v1/2025.findings-emnlp.186

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations

cs.SI · 2026-05-30 · unverdicted · novelty 4.0

ReBias-Lens shows LLM self-reflection produces layer-wise smoothing of global valence fluctuations that reduces behavioral bias overall, yet selectively locks in and amplifies certain category-specific biases.

citing papers explorer

Showing 1 of 1 citing paper.

Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations cs.SI · 2026-05-30 · unverdicted · none · ref 39
ReBias-Lens shows LLM self-reflection produces layer-wise smoothing of global valence fluctuations that reduces behavioral bias overall, yet selectively locks in and amplifies certain category-specific biases.

S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer