Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Clayton Scott; Joan Nwatu; Naihao Deng; Rada Mihalcea; Yilun Zhu

arxiv: 2606.30989 · v1 · pith:JXQS6IUInew · submitted 2026-06-30 · 💻 cs.CL · cs.AI

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Naihao Deng , Yilun Zhu , Joan Nwatu , Clayton Scott , Rada Mihalcea This is my paper

Pith reviewed 2026-07-01 01:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords deductive stereotypingfairness in LLMsreasoning-time injectionFair-GCGbias mitigationchain-of-thought fairnesspopulation-level inference

0 comments

The pith

LLMs apply population statistics to single cases in a failure mode called deductive stereotyping, and phrases found by Fair-GCG steer them toward fairness-aware reasoning at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies deductive stereotyping as the pattern in which large language models draw logically coherent but socially biased conclusions by treating group-level regularities as true for every individual. It supplies a statistical account of why this pattern survives even when chain-of-thought reasoning is used. To counteract it, the authors introduce a reasoning-time injection framework that inserts short learned phrases before the model’s final answer. Fair-GCG, a search procedure, systematically locates phrases that raise scores on multiple fairness benchmarks while preserving performance on ordinary tasks. The same phrases transfer from smaller to larger models, reduce bias in open-ended text, and improve outcomes on real-world fairness-sensitive decisions.

Core claim

Deductive stereotyping is the tendency of LLMs to substitute population-level statistical associations for case-specific reasoning, yielding inferences that are internally consistent yet socially biased. A reasoning-time injection framework counters this by prepending short phrases discovered through Fair-GCG; these phrases measurably raise fairness metrics, generalize across model scales, improve reasoning-level fairness, lower bias in free-form generation, and carry over to downstream fairness-sensitive applications.

What carries the argument

The reasoning-time injection framework, which prepends short learned phrases before the model produces its final answer, together with Fair-GCG, the search method that identifies the phrases.

If this is right

Performance rises on multiple fairness benchmarks when the discovered phrases are injected.
The same phrases found on smaller models remain effective when transferred to larger models.
Reasoning-level fairness metrics improve, not only surface-level output metrics.
Bias decreases in open-ended generation tasks that were not used during phrase search.
The phrases transfer to real-world tasks that involve fairness-sensitive decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the injection mechanism works by altering the model’s internal attention to statistical cues, similar short phrases might be found for other documented reasoning failures such as over-reliance on spurious correlations.
The framework assumes the model already possesses the knowledge needed for fair reasoning; it may therefore be less effective on tasks where the underlying facts themselves are contested or absent from training data.
Because the phrases are discovered automatically, the method could be applied to new fairness definitions or new languages without hand-crafted rules.

Load-bearing premise

Inserting a short phrase at reasoning time can reliably shift an LLM toward fairness-aware inference without lowering accuracy on unrelated tasks or creating new unintended biases.

What would settle it

A controlled test in which the same models are run on the original fairness benchmarks both with and without the discovered phrases, measuring whether the reported gains disappear or reverse when the phrases are withheld.

Figures

Figures reproduced from arXiv: 2606.30989 by Clayton Scott, Joan Nwatu, Naihao Deng, Rada Mihalcea, Yilun Zhu.

**Figure 2.** Figure 2: Examples of deductive stereotyping. The model (Llama 3.1 8B) introduces a generalized social prior, ranging from seemingly benign norms (left), widely circulated narratives (middle), to explicitly harmful stereotypes (right), and deductively applies it as a premise to an individual case, yielding an unjustified and potentially harmful conclusion. 3 Deductive Stereotyping: A Failure Mode in Fairness Reasoni… view at source ↗

**Figure 3.** Figure 3: Without intervention (left), Llama 3.1 8B [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Fairness performance on validation set across varying numbers of training examples. The [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Effect of injection position on average fairness. E denotes injection at the end of [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of hidden-state differences ( [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗

read the original abstract

Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names deductive stereotyping and gives Fair-GCG as a search method for fairness injection phrases, but supplies no data on capability preservation or new biases.

read the letter

The paper defines deductive stereotyping as the pattern where models apply population statistics to individual cases in a logically coherent but biased way. It supplies a statistical interpretation and then introduces Fair-GCG to search for short phrases that can be injected during reasoning to push outputs toward fairness.

The part that holds up is the practical framing. The abstract states that the phrases raise scores on multiple fairness benchmarks, scale from smaller to larger models, improve reasoning-level fairness, reduce bias in open-ended generation, and transfer to some real-world tasks. Treating the issue as a searchable prompt problem rather than a training fix is a direct move that fits existing inference-time work.

The clear gap is the missing controls on side effects. The description gives no before-and-after numbers on standard capability benchmarks and no measurements of whether the phrases create fresh biases on unrelated dimensions. Without those, the claim that the method steers fairness without degrading performance or shifting the problem rests on an untested assumption. The stress-test note correctly flags this.

This is for people working on LLM fairness and prompt-based mitigation. A reader in that niche can take the failure-mode label and the search procedure as usable pieces. It deserves a serious referee because the problem is concrete and the method is straightforward, even if the current evidence is limited to the fairness side.

Send it to peer review with the expectation that referees will require capability checks and broader bias tests.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a failure mode termed 'deductive stereotyping' in which LLMs apply population-level statistical regularities to individual cases, yielding logically coherent but socially biased inferences. It provides a statistical interpretation of the phenomenon and proposes a reasoning-time phrase injection framework, along with Fair-GCG, to discover effective injection phrases that steer models toward fairness-aware reasoning. The abstract asserts that phrases found by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, enhance reasoning-level fairness, reduce bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.

Significance. If the empirical results hold with appropriate controls, the work would supply a practical, training-free intervention for a specific bias mode in LLMs and introduce a systematic discovery procedure (Fair-GCG) that could be reusable. The framing of deductive stereotyping as a distinct, statistically interpretable failure mode adds conceptual clarity to the bias literature. However, the significance is limited by the absence of evidence that fairness gains preserve general capabilities or avoid new failure modes.

major comments (2)

[Abstract and experimental evaluation sections] Abstract and experimental evaluation sections: the central claim that reasoning-time injection improves fairness metrics while leaving general capabilities and other bias dimensions intact lacks any reported results on standard capability benchmarks (MMLU, GSM8K, HumanEval) or unrelated bias axes before versus after injection. This is load-bearing for the 'steer without degrading' guarantee of the proposed framework.
[Fair-GCG method section] Fair-GCG method section: the optimization objective used to discover injection phrases is not shown to incorporate any term that penalizes degradation on non-fairness tasks; without such a term or post-hoc verification, the reported fairness gains could be traded against capability loss.

minor comments (1)

[Abstract] Abstract: the sentence 'improves reasoning-level fairness' does not name the specific metric or baseline comparison used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in our current evaluation that must be addressed to support the framework's claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and experimental evaluation sections] Abstract and experimental evaluation sections: the central claim that reasoning-time injection improves fairness metrics while leaving general capabilities and other bias dimensions intact lacks any reported results on standard capability benchmarks (MMLU, GSM8K, HumanEval) or unrelated bias axes before versus after injection. This is load-bearing for the 'steer without degrading' guarantee of the proposed framework.

Authors: We agree that the manuscript currently lacks direct evidence on capability preservation and other bias axes, which is necessary to substantiate the steering claim. In the revised manuscript we will add before-and-after comparisons on MMLU, GSM8K, and HumanEval, along with checks on unrelated bias dimensions. revision: yes
Referee: [Fair-GCG method section] Fair-GCG method section: the optimization objective used to discover injection phrases is not shown to incorporate any term that penalizes degradation on non-fairness tasks; without such a term or post-hoc verification, the reported fairness gains could be traded against capability loss.

Authors: We acknowledge that the Fair-GCG objective contains no explicit penalty for non-fairness degradation. The revision will incorporate post-hoc verification on the capability benchmarks listed above to confirm that fairness gains do not trade off against general performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent claims

full rationale

The paper introduces deductive stereotyping as a failure mode and proposes a reasoning-time injection framework plus Fair-GCG to discover mitigation phrases. All central claims are empirical (performance gains on fairness benchmarks, generalization across model sizes, transfer to open-ended and real-world tasks). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The statistical interpretation of the phenomenon is presented as an independent contribution rather than a tautology. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces the new concept of deductive stereotyping and the Fair-GCG method; the central claims rest on the domain assumption that phrase injection can produce fairness improvements. No free parameters or invented entities with independent evidence are described in the abstract.

axioms (1)

domain assumption Reasoning-time injection of discovered phrases can steer LLMs toward fairness-aware reasoning
This premise underpins the entire proposed framework and the reported transfer and generalization results.

invented entities (2)

deductive stereotyping no independent evidence
purpose: To name and characterize the identified failure mode of applying population statistics to individuals
New term defined in the paper; no independent evidence outside the described experiments is provided.
Fair-GCG no independent evidence
purpose: To systematically discover effective injection phrases for fairness
New method introduced; no independent evidence outside the described experiments is provided.

pith-pipeline@v0.9.1-grok · 5665 in / 1405 out tokens · 38616 ms · 2026-07-01T01:10:35.499610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 53 canonical work pages · 21 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Shaikh, Omar and Zhang, Hongxin and Held, William and Bernstein, Michael and Yang, Diyi. On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.244

work page doi:10.18653/v1/2023.acl-long.244 2023
[5]

Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Wu, Yufan and He, Yinghui and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao. Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.717

work page doi:10.18653/v1/2023.findings-emnlp.717 2023
[6]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[9]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

and Xu, Yan and Fung, Pascale

Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale. A Multitask, Multilingual, Multimodal Evaluation of C hat GPT on Reasoning, Hallucination, and Interactivity. Proceedings of the 13th Internatio...

work page doi:10.18653/v1/2023.ijcnlp-main.45 2023
[11]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Advances in neural information processing systems , volume=

Fair clustering through fairlets , author=. Advances in neural information processing systems , volume=
[15]

Shahbazi, Nima and Lin, Yin and Asudeh, Abolfazl and Jagadish, H. V. , title =. ACM Comput. Surv. , month = jul, articleno =. 2023 , issue_date =. doi:10.1145/3588433 , abstract =

work page doi:10.1145/3588433 2023
[16]

Lin, Yin and Gupta, Samika and Jagadish, H. V. , booktitle=. Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach , year=
[17]

Advances in neural information processing systems , volume=

On fairness and calibration , author=. Advances in neural information processing systems , volume=
[18]

Advances in neural information processing systems , volume=

Fairness in learning: Classic and contextual bandits , author=. Advances in neural information processing systems , volume=
[19]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[21]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[22]

How Likely Do LLM s with C o T Mimic Human Reasoning?

Bao, Guangsheng and Zhang, Hongbo and Wang, Cunxiang and Yang, Linyi and Zhang, Yue. How Likely Do LLM s with C o T Mimic Human Reasoning?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[23]

Annual review of psychology , volume=

Deductive reasoning , author=. Annual review of psychology , volume=. 1999 , publisher=

1999
[24]

2009 , publisher=

Nudge: Improving decisions about health, wealth, and happiness , author=. 2009 , publisher=

2009
[25]

Scientific Reports , volume=

Pause before action: Waiting short time as a simple and resource-rational boost , author=. Scientific Reports , volume=. 2025 , publisher=

2025
[26]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

BBQ : A hand-built bias benchmark for question answering

Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

work page doi:10.18653/v1/2022.findings-acl.165 2022
[29]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.154

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[30]

Evaluating Gender Bias of LLM s in Making Morality Judgements

Bajaj, Divij and Lei, Yuanyuan and Tong, Jonathan and Huang, Ruihong. Evaluating Gender Bias of LLM s in Making Morality Judgements. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.928

work page doi:10.18653/v1/2024.findings-emnlp.928 2024
[31]

S tereo S et: Measuring stereotypical bias in pretrained language models

Nadeem, Moin and Bethke, Anna and Reddy, Siva. S tereo S et: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.416

work page doi:10.18653/v1/2021.acl-long.416 2021
[32]

W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models

Felkner, Virginia and Chang, Ho-Chun Herbert and Jang, Eugene and May, Jonathan. W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.507

work page doi:10.18653/v1/2023.acl-long.507 2023
[33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

The Falcon Series of Open Language Models

The falcon series of open language models , author=. arXiv preprint arXiv:2311.16867 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[38]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

and Lebrecht, Sophie and Choi, Yejin and Hajishirzi, Hannaneh and Farhadi, Ali and Dodge, Jesse

Liu, Jiacheng and Blanton, Taylor and Elazar, Yanai and Min, Sewon and Chen, Yen-Sung and Chheda-Kothary, Arnavi and Tran, Huy and Bischoff, Byron and Marsh, Eric and Schmitz, Michael and Trier, Cassidy and Sarnat, Aaron and James, Jenna and Borchardt, Jon and Kuehl, Bailey and Cheng, Evie Yu-Yen and Farley, Karen and Anderson, Taira and Albright, David a...

work page doi:10.18653/v1/2025.acl-demo.18 2025
[41]

Membership Inference Attacks against Language Models via Neighbourhood Comparison

Mattern, Justus and Mireshghallah, Fatemehsadat and Jin, Zhijing and Schoelkopf, Bernhard and Sachan, Mrinmaya and Berg-Kirkpatrick, Taylor. Membership Inference Attacks against Language Models via Neighbourhood Comparison. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.719

work page doi:10.18653/v1/2023.findings-acl.719 2023
[42]

Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna. Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

work page doi:10.18653/v1/2021.acl-long.81 2021
[43]

Aristotle’s logic , author=
[44]

, author=

Stereotypes and prejudice: Their automatic and controlled components. , author=. Journal of personality and social psychology , volume=. 1989 , publisher=

1989
[45]

2001 , publisher=

Justice as fairness: A restatement , author=. 2001 , publisher=

2001
[46]

1996 , publisher=

Practical philosophy , author=. 1996 , publisher=

1996
[47]

The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=

Hume on ‘is’ and ‘ought’ , author=. The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=. 1969 , publisher=

1969
[48]

Computational Linguistics , pages=

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. Computational Linguistics , pages=. 2025 , publisher=

2025
[49]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Towards Mitigating

Ji, Ziwei and Yu, Tiezheng and Xu, Yan and Lee, Nayeon and Ishii, Etsuko and Fung, Pascale. Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.123

work page doi:10.18653/v1/2023.findings-emnlp.123 2023
[52]

T as T e: Teaching Large Language Models to Translate through Self-Reflection

Wang, Yutong and Zeng, Jiali and Liu, Xuebo and Meng, Fandong and Zhou, Jie and Zhang, Min. T as T e: Teaching Large Language Models to Translate through Self-Reflection. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.333

work page doi:10.18653/v1/2024.acl-long.333 2024
[53]

H ot F lip: White-Box Adversarial Examples for Text Classification

Ebrahimi, Javid and Rao, Anyi and Lowd, Daniel and Dou, Dejing. H ot F lip: White-Box Adversarial Examples for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2006

work page doi:10.18653/v1/p18-2006 2018
[54]

and Wallace, Eric and Singh, Sameer

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[55]

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

Cheng, Jiale and Liu, Xiao and Zheng, Kehan and Ke, Pei and Wang, Hongning and Dong, Yuxiao and Tang, Jie and Huang, Minlie. Black-Box Prompt Optimization: Aligning Large Language Models without Model Training. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.176

work page doi:10.18653/v1/2024.acl-long.176 2024
[56]

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Singla, Somanshu and Wang, Zhen and Liu, Tianyang and Ashfaq, Abdullah and Hu, Zhiting and Xing, Eric P. Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1220

work page doi:10.18653/v1/2024.emnlp-main.1220 2024
[57]

ACM computing surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM computing surveys , volume=. 2023 , publisher=

2023
[58]

arXiv preprint arXiv:2312.12321 , year=

Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

work page arXiv
[59]

Jailbreaking Leading Safety-Aligned

Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion , booktitle=. Jailbreaking Leading Safety-Aligned. 2025 , url=

2025
[60]

Psychometrika , volume=

Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

1947
[61]

proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=
[62]

Zara Hall and Melanie Subbiah and Thomas P Zollo and Kathleen McKeown and Richard Zemel , booktitle=. Guiding. 2025 , url=

2025
[63]

2016 , publisher=

Big data: A report on algorithmic systems, opportunity, and civil rights , author=. 2016 , publisher=

2016
[64]

International conference on machine learning , pages=

Fairness in reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[65]

Quantifying and Reducing Stereotypes in Word Embeddings

Quantifying and reducing stereotypes in word embeddings , author=. arXiv preprint arXiv:1606.06121 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Advances in neural information processing systems , volume=

Man is to computer programmer as woman is to homemaker? debiasing word embeddings , author=. Advances in neural information processing systems , volume=
[67]

Kelly is a Warm Person, Joseph is a Role Model

“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[68]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Bias unveiled: Investigating social bias in LLM-Generated Code , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[69]

International Conference on Learning Representations , year=

JUSTICE OR PREJUDICE? QUANTIFYING BIASES IN LLM-AS-A-JUDGE , author=. International Conference on Learning Representations , year=
[70]

The Eleventh International Conference on Learning Representations , year=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=
[71]

The Eleventh International Conference on Learning Representations , year=

Decomposed Prompting: A Modular Approach for Solving Complex Tasks , author=. The Eleventh International Conference on Learning Representations , year=
[72]

The eleventh international conference on learning representations , year=

Automatic chain of thought prompting in large language models , author=. The eleventh international conference on learning representations , year=
[73]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Boosting language models reasoning with chain-of-knowledge prompting , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[74]

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation , author=. arXiv preprint arXiv:2510.17062 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[75]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[76]

The Twelfth International Conference on Learning Representations , year=

RAIN: Your Language Models Can Align Themselves without Finetuning , author=. The Twelfth International Conference on Learning Representations , year=
[77]

arXiv preprint arXiv:2310.14735 , year=

Unleashing the potential of prompt engineering in large language models: a comprehensive review , author=. arXiv preprint arXiv:2310.14735 , year=

work page arXiv
[78]

Forty-second International Conference on Machine Learning , year=

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , author=. Forty-second International Conference on Machine Learning , year=
[79]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Multitask instruction-based prompting for fallacy recognition , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[80]

arXiv preprint arXiv:2405.06682 , year=

Self-reflection in llm agents: Effects on problem-solving performance , author=. arXiv preprint arXiv:2405.06682 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[4] [4]

On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Shaikh, Omar and Zhang, Hongxin and Held, William and Bernstein, Michael and Yang, Diyi. On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.244

work page doi:10.18653/v1/2023.acl-long.244 2023

[5] [5]

Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Wu, Yufan and He, Yinghui and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao. Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.717

work page doi:10.18653/v1/2023.findings-emnlp.717 2023

[6] [6]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

[9] [9]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

and Xu, Yan and Fung, Pascale

Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale. A Multitask, Multilingual, Multimodal Evaluation of C hat GPT on Reasoning, Hallucination, and Interactivity. Proceedings of the 13th Internatio...

work page doi:10.18653/v1/2023.ijcnlp-main.45 2023

[11] [11]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Advances in neural information processing systems , volume=

Fair clustering through fairlets , author=. Advances in neural information processing systems , volume=

[15] [15]

Shahbazi, Nima and Lin, Yin and Asudeh, Abolfazl and Jagadish, H. V. , title =. ACM Comput. Surv. , month = jul, articleno =. 2023 , issue_date =. doi:10.1145/3588433 , abstract =

work page doi:10.1145/3588433 2023

[16] [16]

Lin, Yin and Gupta, Samika and Jagadish, H. V. , booktitle=. Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach , year=

[17] [17]

Advances in neural information processing systems , volume=

On fairness and calibration , author=. Advances in neural information processing systems , volume=

[18] [18]

Advances in neural information processing systems , volume=

Fairness in learning: Classic and contextual bandits , author=. Advances in neural information processing systems , volume=

[19] [19]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[21] [21]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023

[22] [22]

How Likely Do LLM s with C o T Mimic Human Reasoning?

Bao, Guangsheng and Zhang, Hongbo and Wang, Cunxiang and Yang, Linyi and Zhang, Yue. How Likely Do LLM s with C o T Mimic Human Reasoning?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[23] [23]

Annual review of psychology , volume=

Deductive reasoning , author=. Annual review of psychology , volume=. 1999 , publisher=

1999

[24] [24]

2009 , publisher=

Nudge: Improving decisions about health, wealth, and happiness , author=. 2009 , publisher=

2009

[25] [25]

Scientific Reports , volume=

Pause before action: Waiting short time as a simple and resource-rational boost , author=. Scientific Reports , volume=. 2025 , publisher=

2025

[26] [26]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

BBQ : A hand-built bias benchmark for question answering

Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

work page doi:10.18653/v1/2022.findings-acl.165 2022

[29] [29]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.154

work page doi:10.18653/v1/2020.emnlp-main.154 2020

[30] [30]

Evaluating Gender Bias of LLM s in Making Morality Judgements

Bajaj, Divij and Lei, Yuanyuan and Tong, Jonathan and Huang, Ruihong. Evaluating Gender Bias of LLM s in Making Morality Judgements. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.928

work page doi:10.18653/v1/2024.findings-emnlp.928 2024

[31] [31]

S tereo S et: Measuring stereotypical bias in pretrained language models

Nadeem, Moin and Bethke, Anna and Reddy, Siva. S tereo S et: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.416

work page doi:10.18653/v1/2021.acl-long.416 2021

[32] [32]

W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models

Felkner, Virginia and Chang, Ho-Chun Herbert and Jang, Eugene and May, Jonathan. W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.507

work page doi:10.18653/v1/2023.acl-long.507 2023

[33] [33]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

PaLM 2 Technical Report

Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

The Falcon Series of Open Language Models

The falcon series of open language models , author=. arXiv preprint arXiv:2311.16867 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023

[38] [38]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

and Lebrecht, Sophie and Choi, Yejin and Hajishirzi, Hannaneh and Farhadi, Ali and Dodge, Jesse

Liu, Jiacheng and Blanton, Taylor and Elazar, Yanai and Min, Sewon and Chen, Yen-Sung and Chheda-Kothary, Arnavi and Tran, Huy and Bischoff, Byron and Marsh, Eric and Schmitz, Michael and Trier, Cassidy and Sarnat, Aaron and James, Jenna and Borchardt, Jon and Kuehl, Bailey and Cheng, Evie Yu-Yen and Farley, Karen and Anderson, Taira and Albright, David a...

work page doi:10.18653/v1/2025.acl-demo.18 2025

[41] [41]

Membership Inference Attacks against Language Models via Neighbourhood Comparison

Mattern, Justus and Mireshghallah, Fatemehsadat and Jin, Zhijing and Schoelkopf, Bernhard and Sachan, Mrinmaya and Berg-Kirkpatrick, Taylor. Membership Inference Attacks against Language Models via Neighbourhood Comparison. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.719

work page doi:10.18653/v1/2023.findings-acl.719 2023

[42] [42]

Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna. Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

work page doi:10.18653/v1/2021.acl-long.81 2021

[43] [43]

Aristotle’s logic , author=

[44] [44]

, author=

Stereotypes and prejudice: Their automatic and controlled components. , author=. Journal of personality and social psychology , volume=. 1989 , publisher=

1989

[45] [45]

2001 , publisher=

Justice as fairness: A restatement , author=. 2001 , publisher=

2001

[46] [46]

1996 , publisher=

Practical philosophy , author=. 1996 , publisher=

1996

[47] [47]

The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=

Hume on ‘is’ and ‘ought’ , author=. The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=. 1969 , publisher=

1969

[48] [48]

Computational Linguistics , pages=

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. Computational Linguistics , pages=. 2025 , publisher=

2025

[49] [49]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Towards Mitigating

Ji, Ziwei and Yu, Tiezheng and Xu, Yan and Lee, Nayeon and Ishii, Etsuko and Fung, Pascale. Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.123

work page doi:10.18653/v1/2023.findings-emnlp.123 2023

[52] [52]

T as T e: Teaching Large Language Models to Translate through Self-Reflection

Wang, Yutong and Zeng, Jiali and Liu, Xuebo and Meng, Fandong and Zhou, Jie and Zhang, Min. T as T e: Teaching Large Language Models to Translate through Self-Reflection. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.333

work page doi:10.18653/v1/2024.acl-long.333 2024

[53] [53]

H ot F lip: White-Box Adversarial Examples for Text Classification

Ebrahimi, Javid and Rao, Anyi and Lowd, Daniel and Dou, Dejing. H ot F lip: White-Box Adversarial Examples for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2006

work page doi:10.18653/v1/p18-2006 2018

[54] [54]

and Wallace, Eric and Singh, Sameer

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020

[55] [55]

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

Cheng, Jiale and Liu, Xiao and Zheng, Kehan and Ke, Pei and Wang, Hongning and Dong, Yuxiao and Tang, Jie and Huang, Minlie. Black-Box Prompt Optimization: Aligning Large Language Models without Model Training. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.176

work page doi:10.18653/v1/2024.acl-long.176 2024

[56] [56]

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Singla, Somanshu and Wang, Zhen and Liu, Tianyang and Ashfaq, Abdullah and Hu, Zhiting and Xing, Eric P. Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1220

work page doi:10.18653/v1/2024.emnlp-main.1220 2024

[57] [57]

ACM computing surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM computing surveys , volume=. 2023 , publisher=

2023

[58] [58]

arXiv preprint arXiv:2312.12321 , year=

Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

work page arXiv

[59] [59]

Jailbreaking Leading Safety-Aligned

Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion , booktitle=. Jailbreaking Leading Safety-Aligned. 2025 , url=

2025

[60] [60]

Psychometrika , volume=

Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

1947

[61] [61]

proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

[62] [62]

Zara Hall and Melanie Subbiah and Thomas P Zollo and Kathleen McKeown and Richard Zemel , booktitle=. Guiding. 2025 , url=

2025

[63] [63]

2016 , publisher=

Big data: A report on algorithmic systems, opportunity, and civil rights , author=. 2016 , publisher=

2016

[64] [64]

International conference on machine learning , pages=

Fairness in reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[65] [65]

Quantifying and Reducing Stereotypes in Word Embeddings

Quantifying and reducing stereotypes in word embeddings , author=. arXiv preprint arXiv:1606.06121 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

Advances in neural information processing systems , volume=

Man is to computer programmer as woman is to homemaker? debiasing word embeddings , author=. Advances in neural information processing systems , volume=

[67] [67]

Kelly is a Warm Person, Joseph is a Role Model

“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023

[68] [68]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Bias unveiled: Investigating social bias in LLM-Generated Code , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[69] [69]

International Conference on Learning Representations , year=

JUSTICE OR PREJUDICE? QUANTIFYING BIASES IN LLM-AS-A-JUDGE , author=. International Conference on Learning Representations , year=

[70] [70]

The Eleventh International Conference on Learning Representations , year=

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

[71] [71]

The Eleventh International Conference on Learning Representations , year=

Decomposed Prompting: A Modular Approach for Solving Complex Tasks , author=. The Eleventh International Conference on Learning Representations , year=

[72] [72]

The eleventh international conference on learning representations , year=

Automatic chain of thought prompting in large language models , author=. The eleventh international conference on learning representations , year=

[73] [73]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Boosting language models reasoning with chain-of-knowledge prompting , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[74] [74]

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation , author=. arXiv preprint arXiv:2510.17062 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[75] [75]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

[76] [76]

The Twelfth International Conference on Learning Representations , year=

RAIN: Your Language Models Can Align Themselves without Finetuning , author=. The Twelfth International Conference on Learning Representations , year=

[77] [77]

arXiv preprint arXiv:2310.14735 , year=

Unleashing the potential of prompt engineering in large language models: a comprehensive review , author=. arXiv preprint arXiv:2310.14735 , year=

work page arXiv

[78] [78]

Forty-second International Conference on Machine Learning , year=

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , author=. Forty-second International Conference on Machine Learning , year=

[79] [79]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Multitask instruction-based prompting for fallacy recognition , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[80] [80]

arXiv preprint arXiv:2405.06682 , year=

Self-reflection in llm agents: Effects on problem-solving performance , author=. arXiv preprint arXiv:2405.06682 , year=

work page arXiv