pith. sign in

arxiv: 2502.19463 · v2 · submitted 2025-02-26 · 💻 cs.CY · cs.AI· cs.SI

Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Pith reviewed 2026-05-23 02:22 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.SI
keywords LLM alignmenthuman rightshedging behaviornon-affirmationgroup identitydebiasingsteering vectors
0
0 comments X

The pith

Group identity is the strongest predictor of hedging and non-affirmation in LLM answers to human rights questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to quantify how often LLMs hedge or refuse to affirm human rights statements when the query names a specific national or ethnic identity. Across 4738 prompts and seven models, four models showed response patterns that depended significantly on the named group, and the size of that dependence exceeded the effects of conflict signals, statelessness, or national GDP. The authors then test mitigation on open-weight models and find that steering vectors tied to group identities reduce the unwanted behaviors more effectively than other approaches while remaining robust to prompt rephrasing. A sympathetic reader would care because human rights are presented as universal, yet the models introduce group-dependent differences in how clearly they endorse them.

Core claim

Four of the seven LLMs exhibit hedging and non-affirmation that is statistically dependent on the queried identity, with identity producing larger effect sizes than conflict, sovereignty, or economic indicators. Group steering applied to open-weight models reduces these rates across query types and does not produce downstream forgetting, making it the strongest mitigation method identified.

What carries the argument

The measurement framework that classifies unconstrained LLM responses as hedging or non-affirmation across 205 identities, paired with group-identity steering vectors for debiasing.

If this is right

  • Identity produces larger disparities in affirmation than conflict signals or economic status.
  • Four of the seven tested models display measurable identity-linked differences in human-rights responses.
  • Group steering reduces hedging rates more effectively than other tested debiasing methods.
  • The observed patterns remain stable when prompts are rephrased.
  • Steering vectors do not cause measurable forgetting on unrelated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar identity-linked patterns may appear in other domains that require uniform ethical judgments.
  • Alignment methods could be extended to neutralize group-specific activations rather than only prompt-level fixes.
  • Downstream applications such as legal or policy drafting may inherit these disparities unless steering is applied.

Load-bearing premise

The classification rules used to label responses as hedging or non-affirmation accurately capture model intent and are not artifacts of prompt wording or evaluator bias.

What would settle it

A re-evaluation of the same prompts with an independent classification method or with human raters who are blind to the identity labels finds no significant dependence on group identity.

Figures

Figures reproduced from arXiv: 2502.19463 by Abdullah Zaini, Anushe Sheikh, Cassandra Parent, David Yanni, Iason Gabriel, Jackie Kay, Laura Weidinger, Maribeth Rauh, Marzyeh Ghassemi, Rafiya Javed, Ramona Comanescu, Walter Gerych.

Figure 1
Figure 1. Figure 1: Baseline rates of hedging and non-affirmation on [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Baseline rates of hedging and non-affirmation re [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean rates of hedging and non-affirmation per [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 7
Figure 7. Figure 7: This reveals a constant gap in metrics between high - [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-group Statistical Parity Difference was calcu [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-group Statistical Parity Difference is shown [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Disparity between high- and low- endorsement iden [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a systematic framework for quantifying hedging and non-affirmation in LLM responses to human rights questions across 205 national and stateless ethnic identities. Using 4738 prompts on six proprietary and one open-weight model, it reports that four of seven models exhibit statistically significant identity-dependent disparities in these behaviors. Group identity is presented as the strongest predictor, with larger effect sizes than conflict signals, sovereignty status, or GDP; prompt rephrasing does not eliminate the pattern. The authors further show that group steering on open-weight models is the most effective mitigation, remaining robust to downstream forgetting.

Significance. If the measurement pipeline is valid, the work provides a large-scale empirical demonstration (4738 prompts, 205 identities) that LLM alignment on universal human rights is identity-sensitive, with effect-size comparisons across multiple covariates and an explicit test of a mitigation technique (group steering). The scale of the evaluation and the steering experiments constitute concrete strengths that would advance understanding of fairness failures in deployed models.

major comments (2)
  1. [Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.
  2. [Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.
minor comments (2)
  1. [Experimental Setup] Clarify whether the 4738 prompts are balanced across the 205 identities or whether some groups receive disproportionately more queries; report the exact distribution.
  2. [Methods] Provide the precise operational definitions and any inter-annotator agreement statistics for the hedging/non-affirmation labels, even if only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our measurement and analysis pipeline. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.

    Authors: We agree that an explicit check for identity-dependent bias in the labeling procedure is necessary to rule out measurement artifacts. Our current robustness section focuses on prompt rephrasing, but does not include a dedicated validation of the classifier across groups. In revision, we will add a new subsection describing a human-annotated validation set (stratified by identity) and report inter-annotator agreement plus any systematic differences in automated labels by group. If disparities are found, we will quantify their impact on the main results. revision: yes

  2. Referee: [Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.

    Authors: We concur that separate univariate comparisons, while informative, do not fully address potential multicollinearity or joint explanatory power. The revised manuscript will include multivariate logistic regressions with all covariates (identity, conflict signals, sovereignty, GDP) entered simultaneously. We will report standardized coefficients, variance inflation factors, and partial R² values to allow direct comparison of effect sizes under controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

This is an empirical measurement study that collects LLM responses to fixed prompts across identity groups, applies classification rules or models to label hedging/non-affirmation, and reports statistical associations (effect sizes, predictors). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the derivation of the central claims. The identity-effect comparison rests on observed data rather than any reduction to the measurement pipeline by construction. Self-citations, if present, are not invoked to justify uniqueness or forbid alternatives. The study is therefore self-contained against external benchmarks and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that hedging is undesirable for human rights statements and that the chosen prompts isolate identity effects; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Human rights apply unambiguously to all groups
    Explicitly stated as the context making hedging undesirable.

pith-pipeline@v0.9.0 · 5797 in / 1021 out tokens · 39753 ms · 2026-05-23T02:22:00.518386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

  1. [1]

    Ahmed Agiza, Mohamed Mostagir, and Sherief Reda. 2024. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs. arXiv preprint arXiv:2404.08699 (2024)

  2. [2]

    Google AI. 2025. Our Principles. https://ai.google/responsibility/principles/

  3. [3]

    Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/ claudes-constitution

  4. [4]

    Anthropic. 2023. Collective Constitutional AI: Aligning a Language Model with Public Input . https://www.anthropic.com/news/collective-constitutional-ai- aligning-a-language-model-with-public-input

  5. [5]

    UN General Assembly. 2024. Seizing the opportunities of safe, secure and trustworthy artificial intelligence systems for sustainable development: reso- lution/adopted by the General Assembly. (2024)

  6. [6]

    Maarten Buyl, Alexander Rogiers, Sander Noels, Iris Dominguez-Catena, Edith Heiter, Raphael Romero, Iman Johary, Alexandru-Cristian Mara, Jefrey Lijffijt, and Tijl De Bie. 2024. Large language models reflect the ideology of their creators. arXiv preprint arXiv:2410.18417 (2024)

  7. [7]

    Corinne Cath, Mark Latonero, Vidushi Marda, and Roya Pakzad. 2020. Leap of FATE: human rights as a complementary framework for AI policy and practice. In Proceedings of the 2020 conference on fairness, accountability, and transparency . 702–702

  8. [8]

    Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. 2020. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7659–7666

  9. [9]

    Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruk- sachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 862–872

  10. [10]

    International Monetary Fund. 2024. World Economic Outlook Database. https: //www.imf.org/en/Publications/WEO/weo-database/2024/April/weo-report

  11. [11]

    Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437

  12. [12]

    Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers. Annals of biomedical engineering 51, 12 (2023), 2629–2633

  13. [13]

    Przemyslaw A Grabowicz, Nicholas Perello, and Aarshee Mishra. 2022. Marrying fairness and explainability in supervised learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency . 1905–1916

  14. [14]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300 (2020)

  15. [15]

    Corinna Hertweck, Christoph Heitz, and Michele Loi. 2021. On the moral justifi- cation of statistical parity. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 747–757

  16. [16]

    J Joseph Hewitt, Jonathan Wilkenfeld, and Ted Robert Gurr. 2017. Peace and conflict 2010. Routledge

  17. [17]

    Cindy Holder and David Reidy. 2013. Human rights: The hard questions . Cam- bridge University Press

  18. [18]

    Steven LB Jensen. 2016. The making of international human rights: the 1960s, decolonization, and the reconstruction of global values . Cambridge University Press

  19. [19]

    Hang Jiang, Doug Beeferman, Brandon Roy, and Deb Roy. 2022. Commu- nityLM: Probing partisan worldviews from language models. arXiv preprint arXiv:2209.07065 (2022)

  20. [20]

    2016.Ma- chine Bias

    Surya Mattu Julia Angwin, Jeff Larson and ProPublica Lauren Kirchner. 2016.Ma- chine Bias. https://www.propublica.org/article/machine-bias-risk-assessments- in-criminal-sentencing

  21. [21]

    Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I. World Scientific, 99–127

  22. [22]

    Gauri Kambhatla, Ian Stewart, and Rada Mihalcea. 2022. Surfacing racial stereo- types through identity portrayal. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1604–1615

  23. [23]

    Atoosa Kasirzadeh and Iason Gabriel. 2023. In conversation with artificial intelli- gence: aligning language models with human values. Philosophy & Technology 36, 2 (2023), 27

  24. [24]

    Daniel Kazenwadel and Christoph V Steinert. 2023. How User Language Affects Conflict Fatality Estimates in ChatGPT. arXiv preprint arXiv:2308.00072 (2023)

  25. [25]

    Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Miku- lik, and Geoffrey Irving. 2021. Alignment of language agents. arXiv preprint arXiv:2103.14659 (2021)

  26. [26]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al

  27. [27]

    arXiv preprint arXiv:2404.16019 (2024)

    The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019 (2024)

  28. [28]

    George Lakoff. 1973. Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic 2, 4 (1973), 458–508

  29. [29]

    Jay L Lemke. 1992. Interpersonal meaning in discourse: Value orientations. Advances in systemic linguistics: Recent theory and practice 82 (1992), 104–126

  30. [30]

    Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149 (2023)

  31. [31]

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

  32. [32]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)

  33. [33]

    Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, and Alexander D’Amour. 2024. Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation. arXiv preprint arXiv:2402.12649 (2024)

  34. [34]

    neutral point of view

    Sorin Adam Matei and Caius Dobrescu. 2011. Wikipedia’s “neutral point of view”: Settling conflict through ambiguity. The Information Society 27, 1 (2011), 40–51

  35. [35]

    Lorna McGregor, Daragh Murray, and Vivian Ng. 2019. International human rights law as a framework for algorithmic accountability. International & Com- parative Law Quarterly 68, 2 (2019), 309–343

  36. [36]

    Merriam-Webster. [n. d.]. Hedge. https://www.merriam-webster.com/dictionary/ hedge

  37. [37]

    Bertalan Meskó. 2023. Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of medical Internet research 25 (2023), e50638

  38. [38]

    PG Meyer. 1997. Hedging strategies in written academic discourse: Strengthening the argument by weakening the claim. Hedging and discourse: Approaches to the analysis of a pragmatic phenomenon in academic texts/Walter de Gruyter & Co (1997)

  39. [39]

    Jared Moore, Tanvi Deshpande, and Diyi Yang. 2024. Are Large Language Models Consistent over Value-laden Questions? arXiv preprint arXiv:2407.02996 (2024)

  40. [40]

    Bureau of Cyberspace and Digital Policy. [n. d.]. Risk Management Profile for Ar- tificial Intelligence and Human Rights. https://www.state.gov/risk-management- profile-for-ai-and-human-rights/#fn4 Accessed: 01/10/2025

  41. [41]

    OpenAI. 2024. Usage Policies. https://openai.com/policies/usage-policies/

  42. [42]

    Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, and Iason Gabriel

  43. [43]

    arXiv preprint arXiv:2210.02667 (2022)

    A human rights-based approach to responsible AI. arXiv preprint arXiv:2210.02667 (2022)

  44. [44]

    David Quinn. 2020. Self-determination movements and their outcomes. In Peace and Conflict 2008. Routledge, 33–38

  45. [45]

    Tim Räz. 2021. Group fairness: Independence revisited. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 129–137

  46. [46]

    Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems . 1–7

  47. [47]

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect?. In International Conference on Machine Learning . PMLR, 29971–30004

  48. [48]

    Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. 2024. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems 36 (2024). Conference’25, July 2025, Athens, Greece Javed et al

  49. [49]

    Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence 4, 3 (2022), 258–268

  50. [50]

    Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al

  51. [51]

    In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

    Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 19937–19947

  52. [52]

    Olivia Steiert. 2024. Declaring crisis? Temporal constructions of climate change on Wikipedia. Public Understanding of Science (2024), 09636625241268890

  53. [53]

    Fritz Strack and Leonard L Martin. 1987. Thinking, judging, and communicating: A process account of context effects in attitude surveys. In Social information processing and survey methodology . Springer, 123–148

  54. [54]

    Karel Vasak. 1977. A 30-year struggle; the sustained efforts to give force of law to the Universal Declaration of Human Rights. (1977)

  55. [55]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

  56. [56]

    Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can Large Language Mod- els Faithfully Express Their Intrinsic Uncertainty in Words? arXiv preprint arXiv:2405.16908 (2024)

  57. [57]

    Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society . 335–340. Do LLMs exhibit demographic parity in responses to queries about Human Rights? Conference’25, July 2025, Athens, Greece A Appendix A.1 Selection of identiti...