Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation
Pith reviewed 2026-05-18 06:53 UTC · model grok-4.3
The pith
Reasoning language models cut social bias by reviewing their own thinking for stereotype repetition and irrelevant details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning-based language models aggregate social bias when their thinking process repeats stereotypes as the main justification or introduces unrelated information to support a biased narrative. Querying the model to review its initial reasoning against these two specific failure patterns produces outputs with measurably less bias on question-answering and open-ended generation tasks while preserving or improving task accuracy.
What carries the argument
A lightweight self-review prompt that directs the model to examine its first reasoning trace for stereotype repetition and irrelevant information injection.
If this is right
- The same review step can be inserted into any chain-of-thought pipeline without retraining.
- Bias reduction should appear on both closed and open-ended tasks once the review targets the identified patterns.
- Accuracy is preserved or improved because the review removes flawed justifications rather than adding constraints.
Where Pith is reading between the lines
- If the review prompt is made task-specific it may further improve results on domains where one failure pattern dominates.
- The approach could be combined with output-level debiasing methods to address bias that survives the thinking stage.
- Developers of future reasoning models might bake similar self-checks into training objectives to reduce reliance on external prompts.
Load-bearing premise
The two failure patterns are the main causes of bias inside the reasoning process and checking for them will continue to work on new tasks and models.
What would settle it
Run the self-review prompt on a fresh social-bias dataset whose dominant error mode is neither stereotype repetition nor irrelevant injection; if bias scores remain unchanged the central claim is false.
Figures
read the original abstract
While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates mechanisms in the thinking processes of reasoning-based LLMs that lead to social bias aggregation. It identifies two failure patterns—stereotype repetition (relying on stereotypes as primary justification) and irrelevant information injection (fabricating details to support biased narratives)—and proposes a lightweight prompt-based mitigation where the model reviews its initial reasoning against these specific patterns. Experiments on BBQ and StereoSet (question answering) and BOLD (open-ended) benchmarks are reported to show effective bias reduction while maintaining or improving accuracy.
Significance. If the central results hold after addressing validation gaps, the work provides a useful empirical probe into internal reasoning behaviors of LLMs on bias tasks and a practical prompt intervention. The emphasis on analyzing the thinking process itself rather than only final outputs is a constructive direction for bias mitigation research. However, the current evidence base is limited by missing controls and details, which reduces the immediate impact until strengthened.
major comments (2)
- [Experiments] The manuscript reports positive results on BBQ, StereoSet, and BOLD but provides insufficient detail on experimental setup, how the two failure patterns were systematically identified, and any statistical analysis of the bias/accuracy changes. This weakens support for the claim that the approach 'effectively reduces bias' (abstract and experiments section).
- [Method and Experiments] No ablation isolates the specific review against stereotype repetition and irrelevant information injection from a generic self-critique or fairness prompt. Without this control, it remains possible that any reflective prompting produces the observed reductions, rendering the discovery of these exact patterns non-load-bearing for the mitigation result (experiments and method sections).
minor comments (2)
- Specify the exact models tested (e.g., particular reasoning LLMs or chain-of-thought variants) and the full prompt templates used for both pattern identification and mitigation review.
- Clarify quantitative metrics for bias and accuracy on the open-ended BOLD benchmark, including any inter-annotator agreement or automated scoring details.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical rigor of our work. We address each major comment below and commit to revisions that provide additional details, controls, and analysis without altering the core claims or methodology.
read point-by-point responses
-
Referee: [Experiments] The manuscript reports positive results on BBQ, StereoSet, and BOLD but provides insufficient detail on experimental setup, how the two failure patterns were systematically identified, and any statistical analysis of the bias/accuracy changes. This weakens support for the claim that the approach 'effectively reduces bias' (abstract and experiments section).
Authors: We agree that greater transparency is needed to support the claims. In the revised manuscript, we expand the experimental setup subsection to specify the exact models used (including versions and temperature settings), full prompting templates for both baseline and mitigation conditions, dataset splits, and evaluation protocols for bias and accuracy metrics. For pattern identification, we add a dedicated paragraph describing the systematic process: a qualitative analysis of 200 randomly sampled thinking traces per benchmark, where two authors independently coded reasoning steps leading to biased outputs, achieving high inter-annotator agreement (Cohen's kappa = 0.82), resulting in the two dominant patterns. We also include statistical analysis throughout the results, reporting mean bias/accuracy changes with standard deviations, paired t-tests for significance (p < 0.05), and effect sizes in updated tables. revision: yes
-
Referee: [Method and Experiments] No ablation isolates the specific review against stereotype repetition and irrelevant information injection from a generic self-critique or fairness prompt. Without this control, it remains possible that any reflective prompting produces the observed reductions, rendering the discovery of these exact patterns non-load-bearing for the mitigation result (experiments and method sections).
Authors: This is a valid concern regarding the specificity of our intervention. While the patterns were derived directly from observed failures in the thinking traces, we recognize that an ablation would better isolate their contribution. In the revised version, we add a new ablation experiment comparing our targeted self-review prompt against (1) a generic self-critique prompt instructing the model to 'review and improve your reasoning' and (2) a general fairness prompt focused on avoiding bias without referencing the specific patterns. Results demonstrate that the pattern-specific review yields statistically larger bias reductions on all benchmarks while preserving accuracy gains, indicating that the identified failure modes are indeed load-bearing for the mitigation effectiveness. These findings are presented in a new subsection and table. revision: yes
Circularity Check
No significant circularity in empirical pattern-to-mitigation chain
full rationale
The paper identifies two failure patterns (stereotype repetition and irrelevant information injection) via systematic investigation of reasoning traces on social bias scenarios, then proposes a prompt-based review step that references exactly those patterns. Experimental results on BBQ, StereoSet, and BOLD are reported as validation. No equations, fitted parameters, or self-citations are shown to reduce the central claim to its inputs by construction; the patterns function as independently observed inputs to an empirically tested intervention rather than a self-definitional loop or renamed known result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning-based LLMs have an internal structured thinking process that can be inspected via prompting.
Forward citations
Cited by 1 Pith paper
-
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models
Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2503.06011
Intent-aware self-correction for mitigating social biases in large language models. arXiv preprint arXiv:2503.06011. Divij Bajaj, Yuanyuan Lei, Jonathan Tong, and Ruihong Huang
-
[2]
InFindings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 15804–15818
Evaluating gender bias of LLMs in making morality judgements. InFindings of the As- sociation for Computational Linguistics: EMNLP 2024, pages 15804–15818. Riccardo Cantini, Nicola Gabriele, Alessio Orsino, and Domenico Talia
work page 2024
-
[3]
arXiv preprint arXiv:2507.02799
Is reasoning all you need? probing bias in the age of reasoning language models. arXiv preprint arXiv:2507.02799. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al
-
[4]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Do not think that much for 2+3=? on the overthinking of o1-like LLMs.arXiv preprint arXiv:2412.21187. Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. 2025a. Reasoning models don’t always say what they think.arXiv preprint arXiv:2505.05410. Yongchao ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948. Fereshteh Hasanzadeh, Colin B Josephson, Gabriella Waters, Demilade Adedinsewo, Zahra Azizi, and James A White
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
InFind- ings of the Association for Computational Linguis- tics: EMNLP 2024, pages 1444–1466
Self- Explore: Enhancing mathematical reasoning in lan- guage models with fine-grained rewards. InFind- ings of the Association for Computational Linguis- tics: EMNLP 2024, pages 1444–1466. 9 Nour Jedidi, Yung-Sung Chuang, James Glass, and Jimmy Lin
work page 2024
- [7]
-
[8]
InACM Transactions on Software Engineering and Methodology
A survey on large language models for code generation. InACM Transactions on Software Engineering and Methodology. Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Al- ice Oh, and Hwaran Lee. 2024a. KoBBQ: Korean bias benchmark for question answering.Transac- tions of the Association for Computational Linguis- tics, 12:507–524. Mingyu Jin, Qinkai Yu, Dong Sh...
work page 2024
-
[9]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa
Eval- uating gender bias in large language models via chain-of-thought prompting.arXiv preprint arXiv:2401.15585. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa
-
[10]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Efficient memory management for large language model serv- ing with pagedattention. InSymposium on Operating Systems Principles, pages 611–626. Jiachun Li, Pengfei Cao, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2025a. Towards better chain-of-thought: A reflection on ef- fectiveness and faithfulness. InFindings of the As- so...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Deconstructing long chain-of- thought: A structured reasoning optimization frame- work for long CoT distillation.arXiv preprint arXiv:2503.16385. Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, and Se-Young Yun
-
[12]
InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 25127–25152
Self- training elicits concise reasoning in large language models. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 25127–25152. Moin Nadeem, Anna Bethke, and Siva Reddy
work page 2025
-
[13]
InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 1722–1742
In-contextual gender bias suppression for large language models. InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 1722–1742. OpenAI
work page 2024
-
[14]
InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105
BBQ: A hand-built bias benchmark for question answering. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105. Reva Schwartz, Apostol Vassilev, Kristen K. Greene, Lori Perine, Andrew Burt, and Patrick Hall
work page 2022
-
[15]
Between underthinking and overthink- ing: An empirical study of reasoning length and cor- rectness in LLMs.arXiv preprint arXiv:2505.00127. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al
-
[16]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599. Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
InAdvances in neural information processing systems, pages 24824– 24837
Chain-of-thought prompting elicits rea- soning in large language models. InAdvances in neural information processing systems, pages 24824– 24837. Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, and Yi Fang. 2025a. Does rea- soning introduce bias? A study of social bias evalua- tion and mitigation in llm reasoning.arXiv preprint arXiv:25...
-
[18]
arXiv preprint arXiv:2406.02050
JBBQ: Japanese bias benchmark for analyzing social biases in large language models. arXiv preprint arXiv:2406.02050. Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Zheng Lin, Li Cao, and Weiping Wang
-
[19]
arXiv preprint arXiv:2504.15895
Dynamic early exit in reason- ing models.arXiv preprint arXiv:2504.15895. Evelyn Yee, Alice Li, Chenyu Tang, Yeon Ho Jung, Ramamohan Paturi, and Leon Bergen
- [20]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.