Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations

Boci Yang; Bo Wang; Dongming Zhao; Jingshen Zhang; Ruifang He; Yuexian Hou; Zifei Yu

arxiv: 2606.00600 · v1 · pith:3CW247ZQnew · submitted 2026-05-30 · 💻 cs.SI

Understanding the Self-Reflection Mechanisms of LLMs through Biased Attitude Associations

Jingshen Zhang , Bo Wang , Boci Yang , Dongming Zhao , Ruifang He , Yuexian Hou , Zifei Yu This is my paper

Pith reviewed 2026-06-28 18:15 UTC · model grok-4.3

classification 💻 cs.SI

keywords self-reflectionLLMssocial biasvalence fluctuationReBias-Lensbias mitigationattitude associationshierarchical divergence

0 comments

The pith

LLM self-reflection smooths overall valence fluctuations across layers while selectively amplifying some category biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ReBias-Lens to investigate how self-reflection in LLMs reconfigures biased attitude associations using valence projections. It tracks Global-VF for broad trends and Local-VF for specific categories in four models over twelve groups. Results indicate layer-wise smoothing and hierarchical divergence that reduces bias in final outputs. This mitigation is not complete, as certain categories show locked-in or increased bias levels. Understanding these mechanics matters for developing more reliable bias reduction in AI systems.

Core claim

The central discovery is that self-reflection mechanisms produce a distinct layer-wise smoothing of valence fluctuations with significant hierarchical representation divergence as layers deepen. This leads to widespread mitigation of bias at the behavioral level but with a stubborn category-specific selectivity that can lock in and amplify localized biases.

What carries the argument

ReBias-Lens, a probing framework that uses Valence Fluctuation (VF) with Global-VF and Local-VF variants to measure changes in biased attitude associations during self-reflection.

If this is right

Overall valence fluctuations smooth out in deeper layers of the model.
Behavioral outputs exhibit reduced bias after the self-reflection process.
Some social categories experience persistent or amplified bias despite the macro-level reduction.
The reconfiguration occurs through valence projection in intersectional contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interventions could focus on specific layers to address the selective amplification.
The findings may apply to designing self-reflection prompts that target category-specific biases.
Similar mechanisms might appear in other internal processes beyond self-reflection.

Load-bearing premise

Social bias is intrinsically encoded as valence inclinations that worsen with sharper fluctuations across social groups.

What would settle it

If measurements with ReBias-Lens show no layer-wise smoothing of valence fluctuations or absence of category-specific selectivity in bias amplification during self-reflection.

Figures

Figures reproduced from arXiv: 2606.00600 by Boci Yang, Bo Wang, Dongming Zhao, Jingshen Zhang, Ruifang He, Yuexian Hou, Zifei Yu.

**Figure 2.** Figure 2: Comparative Analysis of Global-VF Trajec [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Local-VF compared with different settings. Blue and green box plots represent self-reflection states without [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Percentage of bias mitigation across 12 so [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Measuring the Shift from Initial Option B to [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: Mistral-7B-Instruct-v0.3, Projection-Based and Cosine Similarity Comparison [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Qwen2.5-7B-Instruct, Projection-Based and Cosine Similarity Comparison poverty, ugly, cancer, kill, rotten, vomit, agony, prison F Self-Reflection Instructions 12 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Self-reflection instructions 13 [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

While the emergent self-reflection capabilities of Large Language Models (LLMs) offer a promising paradigm for autonomous bias mitigation, their internal mechanics remain unclear, raising concerns regarding potential bias entrenchment. Under the premise that social bias is intrinsically encoded as valence inclinations, where the exacerbation of bias scales with sharper valence fluctuations across social groups, this paper proposes ReBias-Lens, a probing framework designed to interpret how self-reflection reconfigures these biased attitude associations through the lens of valence projection within intersectional contexts. Central to ReBias-Lens is the metric of Valence Fluctuation (VF) comprising two variants: Global-VF, which captures macroscopic valence encoding trends, and Local-VF, which scrutinizes microscopic distinctiveness across specific social categories. Deploying ReBias-Lens to evaluate four LLMs across twelve social categories reveals that overall valence fluctuations undergo a distinct layer-wise smoothing, characterized by a significant hierarchical representation divergence as the layers deepen, which ultimately manifests as a widespread mitigation of bias at the behavioral level. In stark contrast to this macro-level reduction, this reflection mechanism is not universally corrective, instead exhibiting a stubborn, category-specific selectivity that regularly locks in and perversely amplifies localized biases. Warning: this paper contains examples with biased content.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReBias-Lens splits valence effects into global smoothing and local amplification under self-reflection, but the whole reading rests on an untested premise that sharper fluctuations equal greater bias.

read the letter

The main thing to know is that this paper builds ReBias-Lens to track how self-reflection changes valence associations inside LLMs, reporting that overall fluctuations smooth across layers for a behavioral bias drop while some categories show locked-in or increased bias.

What is new is the explicit split into Global-VF for broad trends and Local-VF for category-specific patterns, applied to self-reflection layers in intersectional settings across four models and twelve categories. The layer-wise hierarchical divergence observation extends existing interpretability work on representation changes.

The paper earns credit for laying out a concrete framework and for surfacing the mixed pattern instead of claiming uniform improvement. That distinction between macro reduction and selective entrenchment could matter for safety work that tries to use reflection as a mitigation step.

The soft spot is the opening premise that social bias is encoded as valence inclinations and that exacerbation scales with sharper fluctuations across groups. The abstract states this directly and then reads the smoothing as mitigation and the selectivity as amplification on that basis. No independent validation or external check is referenced to show the valence-to-bias mapping holds or works uniformly. If the link is weaker or category-dependent, the claims about self-reflection mechanisms do not follow. The VF metrics are also defined from the same associations later used to claim effects, which raises a circularity risk.

Methods, datasets, statistical tests, and raw numbers are not visible in the abstract, so it is impossible to judge whether the reported patterns are robust or sensitive to metric choices.

This is for readers working on LLM interpretability or bias diagnostics in safety contexts. Someone looking for new probing tools might try the Global/Local split even if the interpretation needs tightening.

It deserves a serious referee because it supplies a new diagnostic framework with an empirical pattern, even if the premise and validation will need work. Reviewers can test whether the evidence actually supports the mechanism claims.

Referee Report

2 major / 0 minor

Summary. The paper proposes the ReBias-Lens probing framework to examine how self-reflection in LLMs reconfigures biased attitude associations via valence projection in intersectional contexts. Under the explicit premise that social bias is encoded as valence inclinations (with bias exacerbation scaling with sharper valence fluctuations across groups), it introduces Global-VF and Local-VF metrics. Analysis across four LLMs and twelve social categories finds layer-wise smoothing of overall valence fluctuations (with hierarchical representation divergence) that manifests as behavioral-level bias mitigation, contrasted by category-specific selectivity that can lock in or amplify localized biases.

Significance. If the valence-to-bias mapping holds and the VF metrics provide independent evidence, the work could illuminate internal self-reflection dynamics in LLMs and highlight limits of uniform bias mitigation. The layer-wise analysis and distinction between macro mitigation and micro amplification would be a useful contribution to interpretability research in social informatics, particularly if accompanied by falsifiable predictions or reproducible code.

major comments (2)

[Abstract] Abstract: The premise that 'social bias is intrinsically encoded as valence inclinations, where the exacerbation of bias scales with sharper valence fluctuations across social groups' is stated without cited prior validation, independent empirical tests, or sensitivity analysis. This assumption is load-bearing for the central claim, as ReBias-Lens, Global-VF, and Local-VF are constructed around it; without it, the interpretation of layer-wise smoothing as 'widespread mitigation of bias' and category selectivity as 'perversely amplifies localized biases' cannot be read as evidence about self-reflection mechanisms.
[Abstract] Abstract (and framework definition): The Valence Fluctuation metric is defined directly in terms of the same valence associations later used to quantify bias mitigation/amplification. This creates a circularity risk for Global-VF and Local-VF; the reported 'distinct layer-wise smoothing' and 'hierarchical representation divergence' may largely re-express the input associations rather than provide an independent measurement of bias reconfiguration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments highlight important issues regarding the grounding of our core premise and the independence of the VF metrics. We address each point below and outline revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The premise that 'social bias is intrinsically encoded as valence inclinations, where the exacerbation of bias scales with sharper valence fluctuations across social groups' is stated without cited prior validation, independent empirical tests, or sensitivity analysis. This assumption is load-bearing for the central claim, as ReBias-Lens, Global-VF, and Local-VF are constructed around it; without it, the interpretation of layer-wise smoothing as 'widespread mitigation of bias' and category selectivity as 'perversely amplifies localized biases' cannot be read as evidence about self-reflection mechanisms.

Authors: We acknowledge that the premise is presented without explicit citations or sensitivity analysis in the submitted version, making it a load-bearing assumption. The full manuscript draws on concepts from affective computing and implicit bias literature, but these connections were not sufficiently documented. In revision, we will add targeted citations to prior work linking valence fluctuations to bias exacerbation and include a sensitivity analysis varying the fluctuation thresholds and valence sources to test robustness of the layer-wise smoothing and selectivity findings. revision: yes
Referee: [Abstract] Abstract (and framework definition): The Valence Fluctuation metric is defined directly in terms of the same valence associations later used to quantify bias mitigation/amplification. This creates a circularity risk for Global-VF and Local-VF; the reported 'distinct layer-wise smoothing' and 'hierarchical representation divergence' may largely re-express the input associations rather than provide an independent measurement of bias reconfiguration.

Authors: The VF metrics are computed on the evolution of valence projections across layers, using fixed external valence associations as the input baseline; the reported smoothing and divergence reflect changes relative to that baseline during self-reflection. This is not intended as a fully independent measure but as a lens on reconfiguration dynamics. To address the circularity concern, we will revise the methods section for clearer separation of input associations from layer outputs and add an ablation comparing VF trends against direct behavioral bias scores from the same models. revision: partial

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central premise that bias equals valence fluctuation is treated as given rather than derived.

axioms (1)

domain assumption Social bias is intrinsically encoded as valence inclinations, where the exacerbation of bias scales with sharper valence fluctuations across social groups.
Explicitly stated as the premise of the paper in the abstract.

invented entities (2)

ReBias-Lens framework no independent evidence
purpose: Probing tool to interpret self-reflection via valence projection in intersectional contexts.
Newly proposed metric and analysis pipeline introduced in the abstract.
Global-VF and Local-VF metrics no independent evidence
purpose: Quantify macroscopic valence encoding trends and microscopic distinctiveness across social categories.
Core measurement constructs defined for the study.

pith-pipeline@v0.9.1-grok · 5771 in / 1461 out tokens · 19718 ms · 2026-06-28T18:15:11.167889+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 14 canonical work pages

[1]

2024 , eprint=

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis , author=. 2024 , eprint=

2024
[2]

2025 , eprint=

Self-correction is Not An Innate Capability in Large Language Models , author=. 2025 , eprint=

2025
[3]

When Can LLM s Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLM s

Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Han, Jiawei and Zhang, Rui. When Can LLM s Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLM s. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00713

work page doi:10.1162/tacl_a_00713 2024
[4]

2024 , eprint=

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs , author=. 2024 , eprint=

2024
[5]

2024 , eprint=

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions , author=. 2024 , eprint=

2024
[6]

Bryson and Arvind Narayanan , title =

Caliskan, Aylin and Bryson, Joanna J. and Narayanan, Arvind , year=. Semantics derived automatically from language corpora contain human-like biases , volume=. Science , publisher=. doi:10.1126/science.aal4230 , number=

work page doi:10.1126/science.aal4230
[7]

2019 , eprint=

On Measuring Social Biases in Sentence Encoders , author=. 2019 , eprint=

2019
[8]

Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases , url=

Guo, Wei and Caliskan, Aylin , year=. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases , url=. doi:10.1145/3461702.3462536 , booktitle=

work page doi:10.1145/3461702.3462536
[9]

Data Sci

Tommaso Dolci and Fabio Azzalini and Mara Tanelli , title =. Data Sci. Eng. , volume =. 2023 , url =. doi:10.1007/S41019-023-00211-0 , timestamp =

work page doi:10.1007/s41019-023-00211-0 2023
[10]

Proceedings of The Web Conference 2020 , pages =

The Polar Framework: Polar Opposites Enable Interpretability of Pretrained Word Embeddings , author =. Proceedings of The Web Conference 2020 , pages =

2020
[11]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages =

SensePOLAR: Word Sense Aware Interpretability for Pre-trained Contextual Word Embeddings , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages =

2022
[12]

2024 , eprint=

Anchoring Bias in Large Language Models: An Experimental Study , author=. 2024 , eprint=

2024
[13]

2024 , eprint=

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. 2024 , eprint=

2024
[14]

2025 , eprint=

Understanding the Dark Side of LLMs' Intrinsic Self-Correction , author=. 2025 , eprint=

2025
[15]

Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , pages =

Evaluating Biased Attitude Associations of Language Models in an Intersectional Context , author =. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , pages =

2023
[16]

1986 , publisher =

Blackboard Systems , editor =. 1986 , publisher =

1986
[17]

Proceedings of the Eighth International Joint Conference on Artificial Intelligence , pages =

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education , author =. Proceedings of the Eighth International Joint Conference on Artificial Intelligence , pages =. 1983 , publisher =

1983
[18]

Proceedings of the Fourth National Conference on Artificial Intelligence , pages =

Classification Problem Solving , author =. Proceedings of the Fourth National Conference on Artificial Intelligence , pages =. 1984 , publisher =

1984
[19]

Science , volume =

New Ways to Make Microcircuits Smaller , author =. Science , volume =. 1980 , doi =

1980
[20]

International Journal of Man-Machine Studies , volume =

Strategic Explanations for a Diagnostic Consultation System , author =. International Journal of Man-Machine Studies , volume =. 1984 , doi =

1984
[21]

1986 , number =

Poligon: A System for Parallel Problem Solving , author =. 1986 , number =

1986
[22]

1979 , school =

Transfer of Rule-Based Expertise through a Tutorial Dialogue , author =. 1979 , school =

1979
[23]

2021 , note =

The Engineering of Qualitative Models , author =. 2021 , note =

2021
[24]

arXiv preprint arXiv:1706.03762 , year =

Attention Is All You Need , author =. arXiv preprint arXiv:1706.03762 , year =

Pith/arXiv arXiv
[25]

2015 , howpublished =

Pluto: The 'Other' Red Planet , author =. 2015 , howpublished =

2015
[26]

ArXiv , year=

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions , author=. ArXiv , year=
[27]

2024 , url=

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs , author=. 2024 , url=

2024
[28]

Osgood , journal =

Charles E. Osgood , journal =. Semantic Differential Technique in the Comparative Study of Cultures , urldate =
[29]

1981 , publisher=

The Principles of Psychology , author=. 1981 , publisher=

1981
[30]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[31]

arXiv preprint arXiv:2405.20947 , year=

Or-bench: An over-refusal benchmark for large language models , author=. arXiv preprint arXiv:2405.20947 , year=

arXiv
[32]

2021 , eprint=

ValNorm Quantifies Semantics to Reveal Consistent Valence Biases Across Languages and Over Centuries , author=. 2021 , eprint=

2021
[33]

Explicit vs

Zhao, Yachao and Wang, Bo and Wang, Yan and Zhao, Dongming and He, Ruifang and Hou, Yuexian. Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection. Findings of the Association for Computational Linguistics: ACL 2025. 2025

2025
[34]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024
[35]

A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation

Zhao, Yachao and Wang, Bo and Wang, Yan and Zhao, Dongming and Jin, Xiaojia and Zhang, Jijun and He, Ruifang and Hou, Yuexian. A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

2024
[36]

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Liu, Guangliang and Mao, Haitao and Tang, Jiliang and Johnson, Kristen. Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.918

work page doi:10.18653/v1/2024.emnlp-main.918 2024
[37]

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , url=

Renze, Matthew and Guven, Erhan , year=. The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , url=. doi:10.1109/fllm63129.2024.10852493 , booktitle=

work page doi:10.1109/fllm63129.2024.10852493 2024
[38]

ArXiv , year=

Merging Improves Self-Critique Against Jailbreak Attacks , author=. ArXiv , year=
[39]

S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Zhang, Bo and Gao, Cong and Yang, Linkang and Han, Bingxu and Hu, Minghao and Luo, Zhunchen and Geng, Guotong and Bai, Xiaoying and Zhang, Jun and Yao, Wen and Wang, Zhong. S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.findings-emnlp.186 2025
[40]

2024 , eprint=

Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement , author=. 2024 , eprint=

2024
[41]

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Phan, Hoang and Li, Victor and Lei, Qi. Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.503

work page doi:10.18653/v1/2025.findings-emnlp.503 2025
[42]

Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study , url =

Zhao, Lili and Wang, Yang and Liu, Qi and Wang, Mengyun and Chen, Wei and Sheng, Zhichao and Wang, Shijin , booktitle =. Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study , url =
[43]

Towards Mitigating LLM Hallucination via Self Reflection

Ji, Ziwei and Yu, Tiezheng and Xu, Yan and Lee, Nayeon and Ishii, Etsuko and Fung, Pascale. Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.123

work page doi:10.18653/v1/2023.findings-emnlp.123 2023
[44]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023
[45]

2024 , url=

Self-correction is Not An Innate Capability in Large Language Models: A Case Study of Moral Self-correction , author=. 2024 , url=

2024
[47]

Psychological Review , volume =

Implicit social cognition: attitudes, self-esteem, and stereotypes , author =. Psychological Review , volume =. 1995 , month = jan, publisher =

1995
[48]

2022 , eprint=

VAST: The Valence-Assessing Semantics Test for Contextualizing Language Models , author=. 2022 , eprint=

2022
[49]

Journal of Personality and Social Psychology , volume =

Measuring individual differences in implicit cognition: the implicit association test , author =. Journal of Personality and Social Psychology , volume =. 1998 , publisher =

1998
[50]

2014 , doi =

Smith, Joanna and Noble, Helen , title =. 2014 , doi =. https://ebn.bmj.com/content/17/4/100.full.pdf , journal =

2014
[51]

2025 , eprint=

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral , author=. 2025 , eprint=

2025
[52]

The American Journal of Psychology , volume=

An atlas of semantic profiles for 360 words , author=. The American Journal of Psychology , volume=. 1958 , publisher=

1958
[53]

and Taddy, Matt and Evans, James A

Kozlowski, Austin C. and Taddy, Matt and Evans, James A. , year=. The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings , volume=. American Sociological Review , publisher=. doi:10.1177/0003122419877135 , number=

work page doi:10.1177/0003122419877135
[54]

Evidence-Based Nursing , volume =

Bias in research , author =. Evidence-Based Nursing , volume =. 2014 , publisher =

2014
[55]

2023 , eprint=

The Capacity for Moral Self-Correction in Large Language Models , author=. 2023 , eprint=

2023
[56]

2025 , eprint=

Safety Layers in Aligned Large Language Models: The Key to LLM Security , author=. 2025 , eprint=

2025
[57]

ArXiv , year=

HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. ArXiv , year=
[58]

2024 , eprint=

How do Large Language Models Handle Multilingualism? , author=. 2024 , eprint=

2024
[59]

2011 , note =

Dual-process theories and cognitive development: Advances and challenges , journal =. 2011 , note =. doi:https://doi.org/10.1016/j.dr.2011.07.002 , url =

work page doi:10.1016/j.dr.2011.07.002 2011
[60]

2026 , eprint=

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin , author=. 2026 , eprint=

2026
[61]

Behavior Research Methods, Instruments, & Computers , volume =

Words high and low in pleasantness as rated by male and female college students , author =. Behavior Research Methods, Instruments, & Computers , volume =. 1986 , month = may, publisher =

1986
[62]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=
[63]

ArXiv , year=

Mistral 7B , author=. ArXiv , year=
[64]

ArXiv , year=

Qwen Technical Report , author=. ArXiv , year=
[65]

ArXiv , year=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. ArXiv , year=
[66]

Gender bias and stereotypes in Large Language Models , url=

Kotek, Hadas and Dockum, Rikker and Sun, David , year=. Gender bias and stereotypes in Large Language Models , url=. doi:10.1145/3582269.3615599 , booktitle=

work page doi:10.1145/3582269.3615599
[67]

Nature , volume=

The Principles of Psychology , author=. Nature , volume=
[68]

ArXiv , year=

Implicit Bias in LLMs: A Survey , author=. ArXiv , year=
[69]

and Aponte, Ryan and Rossi, Ryan A

Gallegos, Isabel O. and Aponte, Ryan and Rossi, Ryan A. and Barrow, Joe and Tanjim, Mehrab and Yu, Tong and Deilamsalehy, Hanieh and Zhang, Ruiyi and Kim, Sungchul and Dernoncourt, Franck and Lipka, Nedim and Owens, Deonna and Gu, Jiuxiang. Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes. Proceedings of the 2025 Co...

work page doi:10.18653/v1/2025.naacl-short.74 2025
[70]

2025 , eprint=

Exploring the Impact of Personality Traits on LLM Bias and Toxicity , author=. 2025 , eprint=

2025
[71]

2024 , eprint=

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models , author=. 2024 , eprint=

2024
[72]

2022 , eprint=

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers , author=. 2022 , eprint=

2022
[73]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[74]

Publications Manual , year = "1983", publisher =

1983
[75]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[76]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[77]

Dan Gusfield , title =. 1997

1997
[78]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[79]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[1] [1]

2024 , eprint=

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis , author=. 2024 , eprint=

2024

[2] [2]

2025 , eprint=

Self-correction is Not An Innate Capability in Large Language Models , author=. 2025 , eprint=

2025

[3] [3]

When Can LLM s Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLM s

Kamoi, Ryo and Zhang, Yusen and Zhang, Nan and Han, Jiawei and Zhang, Rui. When Can LLM s Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLM s. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00713

work page doi:10.1162/tacl_a_00713 2024

[4] [4]

2024 , eprint=

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs , author=. 2024 , eprint=

2024

[5] [5]

2024 , eprint=

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions , author=. 2024 , eprint=

2024

[6] [6]

Bryson and Arvind Narayanan , title =

Caliskan, Aylin and Bryson, Joanna J. and Narayanan, Arvind , year=. Semantics derived automatically from language corpora contain human-like biases , volume=. Science , publisher=. doi:10.1126/science.aal4230 , number=

work page doi:10.1126/science.aal4230

[7] [7]

2019 , eprint=

On Measuring Social Biases in Sentence Encoders , author=. 2019 , eprint=

2019

[8] [8]

Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases , url=

Guo, Wei and Caliskan, Aylin , year=. Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases , url=. doi:10.1145/3461702.3462536 , booktitle=

work page doi:10.1145/3461702.3462536

[9] [9]

Data Sci

Tommaso Dolci and Fabio Azzalini and Mara Tanelli , title =. Data Sci. Eng. , volume =. 2023 , url =. doi:10.1007/S41019-023-00211-0 , timestamp =

work page doi:10.1007/s41019-023-00211-0 2023

[10] [10]

Proceedings of The Web Conference 2020 , pages =

The Polar Framework: Polar Opposites Enable Interpretability of Pretrained Word Embeddings , author =. Proceedings of The Web Conference 2020 , pages =

2020

[11] [11]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages =

SensePOLAR: Word Sense Aware Interpretability for Pre-trained Contextual Word Embeddings , author =. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages =

2022

[12] [12]

2024 , eprint=

Anchoring Bias in Large Language Models: An Experimental Study , author=. 2024 , eprint=

2024

[13] [13]

2024 , eprint=

Bias Amplification in Language Model Evolution: An Iterated Learning Perspective , author=. 2024 , eprint=

2024

[14] [14]

2025 , eprint=

Understanding the Dark Side of LLMs' Intrinsic Self-Correction , author=. 2025 , eprint=

2025

[15] [15]

Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , pages =

Evaluating Biased Attitude Associations of Language Models in an Intersectional Context , author =. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , pages =

2023

[16] [16]

1986 , publisher =

Blackboard Systems , editor =. 1986 , publisher =

1986

[17] [17]

Proceedings of the Eighth International Joint Conference on Artificial Intelligence , pages =

Communication, Simulation, and Intelligent Agents: Implications of Personal Intelligent Machines for Medical Education , author =. Proceedings of the Eighth International Joint Conference on Artificial Intelligence , pages =. 1983 , publisher =

1983

[18] [18]

Proceedings of the Fourth National Conference on Artificial Intelligence , pages =

Classification Problem Solving , author =. Proceedings of the Fourth National Conference on Artificial Intelligence , pages =. 1984 , publisher =

1984

[19] [19]

Science , volume =

New Ways to Make Microcircuits Smaller , author =. Science , volume =. 1980 , doi =

1980

[20] [20]

International Journal of Man-Machine Studies , volume =

Strategic Explanations for a Diagnostic Consultation System , author =. International Journal of Man-Machine Studies , volume =. 1984 , doi =

1984

[21] [21]

1986 , number =

Poligon: A System for Parallel Problem Solving , author =. 1986 , number =

1986

[22] [22]

1979 , school =

Transfer of Rule-Based Expertise through a Tutorial Dialogue , author =. 1979 , school =

1979

[23] [23]

2021 , note =

The Engineering of Qualitative Models , author =. 2021 , note =

2021

[24] [24]

arXiv preprint arXiv:1706.03762 , year =

Attention Is All You Need , author =. arXiv preprint arXiv:1706.03762 , year =

Pith/arXiv arXiv

[25] [25]

2015 , howpublished =

Pluto: The 'Other' Red Planet , author =. 2015 , howpublished =

2015

[26] [26]

ArXiv , year=

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions , author=. ArXiv , year=

[27] [27]

2024 , url=

Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs , author=. 2024 , url=

2024

[28] [28]

Osgood , journal =

Charles E. Osgood , journal =. Semantic Differential Technique in the Comparative Study of Cultures , urldate =

[29] [29]

1981 , publisher=

The Principles of Psychology , author=. 1981 , publisher=

1981

[30] [30]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[31] [31]

arXiv preprint arXiv:2405.20947 , year=

Or-bench: An over-refusal benchmark for large language models , author=. arXiv preprint arXiv:2405.20947 , year=

arXiv

[32] [32]

2021 , eprint=

ValNorm Quantifies Semantics to Reveal Consistent Valence Biases Across Languages and Over Centuries , author=. 2021 , eprint=

2021

[33] [33]

Explicit vs

Zhao, Yachao and Wang, Bo and Wang, Yan and Zhao, Dongming and He, Ruifang and Hou, Yuexian. Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection. Findings of the Association for Computational Linguistics: ACL 2025. 2025

2025

[34] [34]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024

[35] [35]

A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation

Zhao, Yachao and Wang, Bo and Wang, Yan and Zhao, Dongming and Jin, Xiaojia and Zhang, Jijun and He, Ruifang and Hou, Yuexian. A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

2024

[36] [36]

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

Liu, Guangliang and Mao, Haitao and Tang, Jiliang and Johnson, Kristen. Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.918

work page doi:10.18653/v1/2024.emnlp-main.918 2024

[37] [37]

The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , url=

Renze, Matthew and Guven, Erhan , year=. The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models , url=. doi:10.1109/fllm63129.2024.10852493 , booktitle=

work page doi:10.1109/fllm63129.2024.10852493 2024

[38] [38]

ArXiv , year=

Merging Improves Self-Critique Against Jailbreak Attacks , author=. ArXiv , year=

[39] [39]

S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Zhang, Bo and Gao, Cong and Yang, Linkang and Han, Bingxu and Hu, Minghao and Luo, Zhunchen and Geng, Guotong and Bai, Xiaoying and Zhang, Jun and Yao, Wen and Wang, Zhong. S afe C onf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.findings-emnlp.186 2025

[40] [40]

2024 , eprint=

Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement , author=. 2024 , eprint=

2024

[41] [41]

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

Phan, Hoang and Li, Victor and Lei, Qi. Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.503

work page doi:10.18653/v1/2025.findings-emnlp.503 2025

[42] [42]

Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study , url =

Zhao, Lili and Wang, Yang and Liu, Qi and Wang, Mengyun and Chen, Wei and Sheng, Zhichao and Wang, Shijin , booktitle =. Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study , url =

[43] [43]

Towards Mitigating LLM Hallucination via Self Reflection

Ji, Ziwei and Yu, Tiezheng and Xu, Yan and Lee, Nayeon and Ishii, Etsuko and Fung, Pascale. Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.123

work page doi:10.18653/v1/2023.findings-emnlp.123 2023

[44] [44]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023

[45] [45]

2024 , url=

Self-correction is Not An Innate Capability in Large Language Models: A Case Study of Moral Self-correction , author=. 2024 , url=

2024

[46] [47]

Psychological Review , volume =

Implicit social cognition: attitudes, self-esteem, and stereotypes , author =. Psychological Review , volume =. 1995 , month = jan, publisher =

1995

[47] [48]

2022 , eprint=

VAST: The Valence-Assessing Semantics Test for Contextualizing Language Models , author=. 2022 , eprint=

2022

[48] [49]

Journal of Personality and Social Psychology , volume =

Measuring individual differences in implicit cognition: the implicit association test , author =. Journal of Personality and Social Psychology , volume =. 1998 , publisher =

1998

[49] [50]

2014 , doi =

Smith, Joanna and Noble, Helen , title =. 2014 , doi =. https://ebn.bmj.com/content/17/4/100.full.pdf , journal =

2014

[50] [51]

2025 , eprint=

Self-Reflection Makes Large Language Models Safer, Less Biased, and Ideologically Neutral , author=. 2025 , eprint=

2025

[51] [52]

The American Journal of Psychology , volume=

An atlas of semantic profiles for 360 words , author=. The American Journal of Psychology , volume=. 1958 , publisher=

1958

[52] [53]

and Taddy, Matt and Evans, James A

Kozlowski, Austin C. and Taddy, Matt and Evans, James A. , year=. The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings , volume=. American Sociological Review , publisher=. doi:10.1177/0003122419877135 , number=

work page doi:10.1177/0003122419877135

[53] [54]

Evidence-Based Nursing , volume =

Bias in research , author =. Evidence-Based Nursing , volume =. 2014 , publisher =

2014

[54] [55]

2023 , eprint=

The Capacity for Moral Self-Correction in Large Language Models , author=. 2023 , eprint=

2023

[55] [56]

2025 , eprint=

Safety Layers in Aligned Large Language Models: The Key to LLM Security , author=. 2025 , eprint=

2025

[56] [57]

ArXiv , year=

HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. ArXiv , year=

[57] [58]

2024 , eprint=

How do Large Language Models Handle Multilingualism? , author=. 2024 , eprint=

2024

[58] [59]

2011 , note =

Dual-process theories and cognitive development: Advances and challenges , journal =. 2011 , note =. doi:https://doi.org/10.1016/j.dr.2011.07.002 , url =

work page doi:10.1016/j.dr.2011.07.002 2011

[59] [60]

2026 , eprint=

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin , author=. 2026 , eprint=

2026

[60] [61]

Behavior Research Methods, Instruments, & Computers , volume =

Words high and low in pleasantness as rated by male and female college students , author =. Behavior Research Methods, Instruments, & Computers , volume =. 1986 , month = may, publisher =

1986

[61] [62]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

[62] [63]

ArXiv , year=

Mistral 7B , author=. ArXiv , year=

[63] [64]

ArXiv , year=

Qwen Technical Report , author=. ArXiv , year=

[64] [65]

ArXiv , year=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. ArXiv , year=

[65] [66]

Gender bias and stereotypes in Large Language Models , url=

Kotek, Hadas and Dockum, Rikker and Sun, David , year=. Gender bias and stereotypes in Large Language Models , url=. doi:10.1145/3582269.3615599 , booktitle=

work page doi:10.1145/3582269.3615599

[66] [67]

Nature , volume=

The Principles of Psychology , author=. Nature , volume=

[67] [68]

ArXiv , year=

Implicit Bias in LLMs: A Survey , author=. ArXiv , year=

[68] [69]

and Aponte, Ryan and Rossi, Ryan A

Gallegos, Isabel O. and Aponte, Ryan and Rossi, Ryan A. and Barrow, Joe and Tanjim, Mehrab and Yu, Tong and Deilamsalehy, Hanieh and Zhang, Ruiyi and Kim, Sungchul and Dernoncourt, Franck and Lipka, Nedim and Owens, Deonna and Gu, Jiuxiang. Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes. Proceedings of the 2025 Co...

work page doi:10.18653/v1/2025.naacl-short.74 2025

[69] [70]

2025 , eprint=

Exploring the Impact of Personality Traits on LLM Bias and Toxicity , author=. 2025 , eprint=

2025

[70] [71]

2024 , eprint=

Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models , author=. 2024 , eprint=

2024

[71] [72]

2022 , eprint=

A New Generation of Perspective API: Efficient Multilingual Character-level Transformers , author=. 2022 , eprint=

2022

[72] [73]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[73] [74]

Publications Manual , year = "1983", publisher =

1983

[74] [75]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[75] [76]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[76] [77]

Dan Gusfield , title =. 1997

1997

[77] [78]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[78] [79]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =