pith. sign in

arxiv: 2606.30989 · v1 · pith:JXQS6IUInew · submitted 2026-06-30 · 💻 cs.CL · cs.AI

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Pith reviewed 2026-07-01 01:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords deductive stereotypingfairness in LLMsreasoning-time injectionFair-GCGbias mitigationchain-of-thought fairnesspopulation-level inference
0
0 comments X

The pith

LLMs apply population statistics to single cases in a failure mode called deductive stereotyping, and phrases found by Fair-GCG steer them toward fairness-aware reasoning at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies deductive stereotyping as the pattern in which large language models draw logically coherent but socially biased conclusions by treating group-level regularities as true for every individual. It supplies a statistical account of why this pattern survives even when chain-of-thought reasoning is used. To counteract it, the authors introduce a reasoning-time injection framework that inserts short learned phrases before the model’s final answer. Fair-GCG, a search procedure, systematically locates phrases that raise scores on multiple fairness benchmarks while preserving performance on ordinary tasks. The same phrases transfer from smaller to larger models, reduce bias in open-ended text, and improve outcomes on real-world fairness-sensitive decisions.

Core claim

Deductive stereotyping is the tendency of LLMs to substitute population-level statistical associations for case-specific reasoning, yielding inferences that are internally consistent yet socially biased. A reasoning-time injection framework counters this by prepending short phrases discovered through Fair-GCG; these phrases measurably raise fairness metrics, generalize across model scales, improve reasoning-level fairness, lower bias in free-form generation, and carry over to downstream fairness-sensitive applications.

What carries the argument

The reasoning-time injection framework, which prepends short learned phrases before the model produces its final answer, together with Fair-GCG, the search method that identifies the phrases.

If this is right

  • Performance rises on multiple fairness benchmarks when the discovered phrases are injected.
  • The same phrases found on smaller models remain effective when transferred to larger models.
  • Reasoning-level fairness metrics improve, not only surface-level output metrics.
  • Bias decreases in open-ended generation tasks that were not used during phrase search.
  • The phrases transfer to real-world tasks that involve fairness-sensitive decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the injection mechanism works by altering the model’s internal attention to statistical cues, similar short phrases might be found for other documented reasoning failures such as over-reliance on spurious correlations.
  • The framework assumes the model already possesses the knowledge needed for fair reasoning; it may therefore be less effective on tasks where the underlying facts themselves are contested or absent from training data.
  • Because the phrases are discovered automatically, the method could be applied to new fairness definitions or new languages without hand-crafted rules.

Load-bearing premise

Inserting a short phrase at reasoning time can reliably shift an LLM toward fairness-aware inference without lowering accuracy on unrelated tasks or creating new unintended biases.

What would settle it

A controlled test in which the same models are run on the original fairness benchmarks both with and without the discovered phrases, measuring whether the reported gains disappear or reverse when the phrases are withheld.

Figures

Figures reproduced from arXiv: 2606.30989 by Clayton Scott, Joan Nwatu, Naihao Deng, Rada Mihalcea, Yilun Zhu.

Figure 1
Figure 1. Figure 1: Without intervention (left), Qwen 2.5 7B [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of deductive stereotyping. The model (Llama 3.1 8B) introduces a generalized social prior, ranging from seemingly benign norms (left), widely circulated narratives (middle), to explicitly harmful stereotypes (right), and deductively applies it as a premise to an individual case, yielding an unjustified and potentially harmful conclusion. 3 Deductive Stereotyping: A Failure Mode in Fairness Reasoni… view at source ↗
Figure 3
Figure 3. Figure 3: Without intervention (left), Llama 3.1 8B [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fairness performance on validation set across varying numbers of training examples. The [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Effect of injection position on average fairness. E denotes injection at the end of [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of hidden-state differences ( [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗
read the original abstract

Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a failure mode termed 'deductive stereotyping' in which LLMs apply population-level statistical regularities to individual cases, yielding logically coherent but socially biased inferences. It provides a statistical interpretation of the phenomenon and proposes a reasoning-time phrase injection framework, along with Fair-GCG, to discover effective injection phrases that steer models toward fairness-aware reasoning. The abstract asserts that phrases found by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, enhance reasoning-level fairness, reduce bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.

Significance. If the empirical results hold with appropriate controls, the work would supply a practical, training-free intervention for a specific bias mode in LLMs and introduce a systematic discovery procedure (Fair-GCG) that could be reusable. The framing of deductive stereotyping as a distinct, statistically interpretable failure mode adds conceptual clarity to the bias literature. However, the significance is limited by the absence of evidence that fairness gains preserve general capabilities or avoid new failure modes.

major comments (2)
  1. [Abstract and experimental evaluation sections] Abstract and experimental evaluation sections: the central claim that reasoning-time injection improves fairness metrics while leaving general capabilities and other bias dimensions intact lacks any reported results on standard capability benchmarks (MMLU, GSM8K, HumanEval) or unrelated bias axes before versus after injection. This is load-bearing for the 'steer without degrading' guarantee of the proposed framework.
  2. [Fair-GCG method section] Fair-GCG method section: the optimization objective used to discover injection phrases is not shown to incorporate any term that penalizes degradation on non-fairness tasks; without such a term or post-hoc verification, the reported fairness gains could be traded against capability loss.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'improves reasoning-level fairness' does not name the specific metric or baseline comparison used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in our current evaluation that must be addressed to support the framework's claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and experimental evaluation sections] Abstract and experimental evaluation sections: the central claim that reasoning-time injection improves fairness metrics while leaving general capabilities and other bias dimensions intact lacks any reported results on standard capability benchmarks (MMLU, GSM8K, HumanEval) or unrelated bias axes before versus after injection. This is load-bearing for the 'steer without degrading' guarantee of the proposed framework.

    Authors: We agree that the manuscript currently lacks direct evidence on capability preservation and other bias axes, which is necessary to substantiate the steering claim. In the revised manuscript we will add before-and-after comparisons on MMLU, GSM8K, and HumanEval, along with checks on unrelated bias dimensions. revision: yes

  2. Referee: [Fair-GCG method section] Fair-GCG method section: the optimization objective used to discover injection phrases is not shown to incorporate any term that penalizes degradation on non-fairness tasks; without such a term or post-hoc verification, the reported fairness gains could be traded against capability loss.

    Authors: We acknowledge that the Fair-GCG objective contains no explicit penalty for non-fairness degradation. The revision will incorporate post-hoc verification on the capability benchmarks listed above to confirm that fairness gains do not trade off against general performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent claims

full rationale

The paper introduces deductive stereotyping as a failure mode and proposes a reasoning-time injection framework plus Fair-GCG to discover mitigation phrases. All central claims are empirical (performance gains on fairness benchmarks, generalization across model sizes, transfer to open-ended and real-world tasks). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The statistical interpretation of the phenomenon is presented as an independent contribution rather than a tautology. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces the new concept of deductive stereotyping and the Fair-GCG method; the central claims rest on the domain assumption that phrase injection can produce fairness improvements. No free parameters or invented entities with independent evidence are described in the abstract.

axioms (1)
  • domain assumption Reasoning-time injection of discovered phrases can steer LLMs toward fairness-aware reasoning
    This premise underpins the entire proposed framework and the reported transfer and generalization results.
invented entities (2)
  • deductive stereotyping no independent evidence
    purpose: To name and characterize the identified failure mode of applying population statistics to individuals
    New term defined in the paper; no independent evidence outside the described experiments is provided.
  • Fair-GCG no independent evidence
    purpose: To systematically discover effective injection phrases for fairness
    New method introduced; no independent evidence outside the described experiments is provided.

pith-pipeline@v0.9.1-grok · 5665 in / 1405 out tokens · 38616 ms · 2026-07-01T01:10:35.499610+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 53 canonical work pages · 21 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [4]

    On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

    Shaikh, Omar and Zhang, Hongxin and Held, William and Bernstein, Michael and Yang, Diyi. On Second Thought, Let ' s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.244

  5. [5]

    Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

    Wu, Yufan and He, Yinghui and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao. Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.717

  6. [6]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  7. [7]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

  8. [8]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  9. [9]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  10. [10]

    and Xu, Yan and Fung, Pascale

    Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale. A Multitask, Multilingual, Multimodal Evaluation of C hat GPT on Reasoning, Hallucination, and Interactivity. Proceedings of the 13th Internatio...

  11. [11]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  12. [12]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  14. [14]

    Advances in neural information processing systems , volume=

    Fair clustering through fairlets , author=. Advances in neural information processing systems , volume=

  15. [15]

    Shahbazi, Nima and Lin, Yin and Asudeh, Abolfazl and Jagadish, H. V. , title =. ACM Comput. Surv. , month = jul, articleno =. 2023 , issue_date =. doi:10.1145/3588433 , abstract =

  16. [16]

    Lin, Yin and Gupta, Samika and Jagadish, H. V. , booktitle=. Mitigating Subgroup Unfairness in Machine Learning Classifiers: A Data-Driven Approach , year=

  17. [17]

    Advances in neural information processing systems , volume=

    On fairness and calibration , author=. Advances in neural information processing systems , volume=

  18. [18]

    Advances in neural information processing systems , volume=

    Fairness in learning: Classic and contextual bandits , author=. Advances in neural information processing systems , volume=

  19. [19]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  20. [20]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  21. [21]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

  22. [22]

    How Likely Do LLM s with C o T Mimic Human Reasoning?

    Bao, Guangsheng and Zhang, Hongbo and Wang, Cunxiang and Yang, Linyi and Zhang, Yue. How Likely Do LLM s with C o T Mimic Human Reasoning?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  23. [23]

    Annual review of psychology , volume=

    Deductive reasoning , author=. Annual review of psychology , volume=. 1999 , publisher=

  24. [24]

    2009 , publisher=

    Nudge: Improving decisions about health, wealth, and happiness , author=. 2009 , publisher=

  25. [25]

    Scientific Reports , volume=

    Pause before action: Waiting short time as a simple and resource-rational boost , author=. Scientific Reports , volume=. 2025 , publisher=

  26. [26]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  27. [27]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

  28. [28]

    BBQ : A hand-built bias benchmark for question answering

    Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

  29. [29]

    C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.154

  30. [30]

    Evaluating Gender Bias of LLM s in Making Morality Judgements

    Bajaj, Divij and Lei, Yuanyuan and Tong, Jonathan and Huang, Ruihong. Evaluating Gender Bias of LLM s in Making Morality Judgements. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.928

  31. [31]

    S tereo S et: Measuring stereotypical bias in pretrained language models

    Nadeem, Moin and Bethke, Anna and Reddy, Siva. S tereo S et: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.416

  32. [32]

    W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models

    Felkner, Virginia and Chang, Ho-Chun Herbert and Jang, Eugene and May, Jonathan. W ino Q ueer: A Community-in-the-Loop Benchmark for Anti- LGBTQ + Bias in Large Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.507

  33. [33]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  34. [34]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  35. [35]

    PaLM 2 Technical Report

    Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

  36. [36]

    The Falcon Series of Open Language Models

    The falcon series of open language models , author=. arXiv preprint arXiv:2311.16867 , year=

  37. [37]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  38. [38]

    Qwen2 Technical Report

    Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , volume=

  39. [39]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  40. [40]

    and Lebrecht, Sophie and Choi, Yejin and Hajishirzi, Hannaneh and Farhadi, Ali and Dodge, Jesse

    Liu, Jiacheng and Blanton, Taylor and Elazar, Yanai and Min, Sewon and Chen, Yen-Sung and Chheda-Kothary, Arnavi and Tran, Huy and Bischoff, Byron and Marsh, Eric and Schmitz, Michael and Trier, Cassidy and Sarnat, Aaron and James, Jenna and Borchardt, Jon and Kuehl, Bailey and Cheng, Evie Yu-Yen and Farley, Karen and Anderson, Taira and Albright, David a...

  41. [41]

    Membership Inference Attacks against Language Models via Neighbourhood Comparison

    Mattern, Justus and Mireshghallah, Fatemehsadat and Jin, Zhijing and Schoelkopf, Bernhard and Sachan, Mrinmaya and Berg-Kirkpatrick, Taylor. Membership Inference Attacks against Language Models via Neighbourhood Comparison. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.719

  42. [42]

    Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets

    Blodgett, Su Lin and Lopez, Gilsinia and Olteanu, Alexandra and Sim, Robert and Wallach, Hanna. Stereotyping N orwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1...

  43. [43]

    Aristotle’s logic , author=

  44. [44]

    , author=

    Stereotypes and prejudice: Their automatic and controlled components. , author=. Journal of personality and social psychology , volume=. 1989 , publisher=

  45. [45]

    2001 , publisher=

    Justice as fairness: A restatement , author=. 2001 , publisher=

  46. [46]

    1996 , publisher=

    Practical philosophy , author=. 1996 , publisher=

  47. [47]

    The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=

    Hume on ‘is’ and ‘ought’ , author=. The Is-Ought Question: A Collection of Papers on the Central Problem in Moral Philosophy , pages=. 1969 , publisher=

  48. [48]

    Computational Linguistics , pages=

    Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. Computational Linguistics , pages=. 2025 , publisher=

  49. [49]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  50. [50]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  51. [51]

    Towards Mitigating

    Ji, Ziwei and Yu, Tiezheng and Xu, Yan and Lee, Nayeon and Ishii, Etsuko and Fung, Pascale. Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.123

  52. [52]

    T as T e: Teaching Large Language Models to Translate through Self-Reflection

    Wang, Yutong and Zeng, Jiali and Liu, Xuebo and Meng, Fandong and Zhou, Jie and Zhang, Min. T as T e: Teaching Large Language Models to Translate through Self-Reflection. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.333

  53. [53]

    H ot F lip: White-Box Adversarial Examples for Text Classification

    Ebrahimi, Javid and Rao, Anyi and Lowd, Daniel and Dou, Dejing. H ot F lip: White-Box Adversarial Examples for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018. doi:10.18653/v1/P18-2006

  54. [54]

    and Wallace, Eric and Singh, Sameer

    Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

  55. [55]

    Black-Box Prompt Optimization: Aligning Large Language Models without Model Training

    Cheng, Jiale and Liu, Xiao and Zheng, Kehan and Ke, Pei and Wang, Hongning and Dong, Yuxiao and Tang, Jie and Huang, Minlie. Black-Box Prompt Optimization: Aligning Large Language Models without Model Training. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.176

  56. [56]

    Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

    Singla, Somanshu and Wang, Zhen and Liu, Tianyang and Ashfaq, Abdullah and Hu, Zhiting and Xing, Eric P. Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1220

  57. [57]

    ACM computing surveys , volume=

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM computing surveys , volume=. 2023 , publisher=

  58. [58]

    arXiv preprint arXiv:2312.12321 , year=

    Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

  59. [59]

    Jailbreaking Leading Safety-Aligned

    Maksym Andriushchenko and Francesco Croce and Nicolas Flammarion , booktitle=. Jailbreaking Leading Safety-Aligned. 2025 , url=

  60. [60]

    Psychometrika , volume=

    Note on the sampling error of the difference between correlated proportions or percentages , author=. Psychometrika , volume=. 1947 , publisher=

  61. [61]

    proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

    Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

  62. [62]

    Zara Hall and Melanie Subbiah and Thomas P Zollo and Kathleen McKeown and Richard Zemel , booktitle=. Guiding. 2025 , url=

  63. [63]

    2016 , publisher=

    Big data: A report on algorithmic systems, opportunity, and civil rights , author=. 2016 , publisher=

  64. [64]

    International conference on machine learning , pages=

    Fairness in reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

  65. [65]

    Quantifying and Reducing Stereotypes in Word Embeddings

    Quantifying and reducing stereotypes in word embeddings , author=. arXiv preprint arXiv:1606.06121 , year=

  66. [66]

    Advances in neural information processing systems , volume=

    Man is to computer programmer as woman is to homemaker? debiasing word embeddings , author=. Advances in neural information processing systems , volume=

  67. [67]

    Kelly is a Warm Person, Joseph is a Role Model

    “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  68. [68]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Bias unveiled: Investigating social bias in LLM-Generated Code , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  69. [69]

    International Conference on Learning Representations , year=

    JUSTICE OR PREJUDICE? QUANTIFYING BIASES IN LLM-AS-A-JUDGE , author=. International Conference on Learning Representations , year=

  70. [70]

    The Eleventh International Conference on Learning Representations , year=

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  71. [71]

    The Eleventh International Conference on Learning Representations , year=

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks , author=. The Eleventh International Conference on Learning Representations , year=

  72. [72]

    The eleventh international conference on learning representations , year=

    Automatic chain of thought prompting in large language models , author=. The eleventh international conference on learning representations , year=

  73. [73]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Boosting language models reasoning with chain-of-knowledge prompting , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  74. [74]

    Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

    Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation , author=. arXiv preprint arXiv:2510.17062 , year=

  75. [75]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Steering away from harm: An adaptive approach to defending vision language model against jailbreaks , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  76. [76]

    The Twelfth International Conference on Learning Representations , year=

    RAIN: Your Language Models Can Align Themselves without Finetuning , author=. The Twelfth International Conference on Learning Representations , year=

  77. [77]

    arXiv preprint arXiv:2310.14735 , year=

    Unleashing the potential of prompt engineering in large language models: a comprehensive review , author=. arXiv preprint arXiv:2310.14735 , year=

  78. [78]

    Forty-second International Conference on Machine Learning , year=

    Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback , author=. Forty-second International Conference on Machine Learning , year=

  79. [79]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Multitask instruction-based prompting for fallacy recognition , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  80. [80]

    arXiv preprint arXiv:2405.06682 , year=

    Self-reflection in llm agents: Effects on problem-solving performance , author=. arXiv preprint arXiv:2405.06682 , year=

Showing first 80 references.