Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights
Pith reviewed 2026-05-23 02:22 UTC · model grok-4.3
The pith
Group identity is the strongest predictor of hedging and non-affirmation in LLM answers to human rights questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Four of the seven LLMs exhibit hedging and non-affirmation that is statistically dependent on the queried identity, with identity producing larger effect sizes than conflict, sovereignty, or economic indicators. Group steering applied to open-weight models reduces these rates across query types and does not produce downstream forgetting, making it the strongest mitigation method identified.
What carries the argument
The measurement framework that classifies unconstrained LLM responses as hedging or non-affirmation across 205 identities, paired with group-identity steering vectors for debiasing.
If this is right
- Identity produces larger disparities in affirmation than conflict signals or economic status.
- Four of the seven tested models display measurable identity-linked differences in human-rights responses.
- Group steering reduces hedging rates more effectively than other tested debiasing methods.
- The observed patterns remain stable when prompts are rephrased.
- Steering vectors do not cause measurable forgetting on unrelated tasks.
Where Pith is reading between the lines
- Similar identity-linked patterns may appear in other domains that require uniform ethical judgments.
- Alignment methods could be extended to neutralize group-specific activations rather than only prompt-level fixes.
- Downstream applications such as legal or policy drafting may inherit these disparities unless steering is applied.
Load-bearing premise
The classification rules used to label responses as hedging or non-affirmation accurately capture model intent and are not artifacts of prompt wording or evaluator bias.
What would settle it
A re-evaluation of the same prompts with an independent classification method or with human raters who are blind to the identity labels finds no significant dependence on group identity.
Figures
read the original abstract
Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a systematic framework for quantifying hedging and non-affirmation in LLM responses to human rights questions across 205 national and stateless ethnic identities. Using 4738 prompts on six proprietary and one open-weight model, it reports that four of seven models exhibit statistically significant identity-dependent disparities in these behaviors. Group identity is presented as the strongest predictor, with larger effect sizes than conflict signals, sovereignty status, or GDP; prompt rephrasing does not eliminate the pattern. The authors further show that group steering on open-weight models is the most effective mitigation, remaining robust to downstream forgetting.
Significance. If the measurement pipeline is valid, the work provides a large-scale empirical demonstration (4738 prompts, 205 identities) that LLM alignment on universal human rights is identity-sensitive, with effect-size comparisons across multiple covariates and an explicit test of a mitigation technique (group steering). The scale of the evaluation and the steering experiments constitute concrete strengths that would advance understanding of fairness failures in deployed models.
major comments (2)
- [Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.
- [Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.
minor comments (2)
- [Experimental Setup] Clarify whether the 4738 prompts are balanced across the 205 identities or whether some groups receive disproportionately more queries; report the exact distribution.
- [Methods] Provide the precise operational definitions and any inter-annotator agreement statistics for the hedging/non-affirmation labels, even if only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our measurement and analysis pipeline. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.
Authors: We agree that an explicit check for identity-dependent bias in the labeling procedure is necessary to rule out measurement artifacts. Our current robustness section focuses on prompt rephrasing, but does not include a dedicated validation of the classifier across groups. In revision, we will add a new subsection describing a human-annotated validation set (stratified by identity) and report inter-annotator agreement plus any systematic differences in automated labels by group. If disparities are found, we will quantify their impact on the main results. revision: yes
-
Referee: [Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.
Authors: We concur that separate univariate comparisons, while informative, do not fully address potential multicollinearity or joint explanatory power. The revised manuscript will include multivariate logistic regressions with all covariates (identity, conflict signals, sovereignty, GDP) entered simultaneously. We will report standardized coefficients, variance inflation factors, and partial R² values to allow direct comparison of effect sizes under controls. revision: yes
Circularity Check
No significant circularity in empirical measurement study
full rationale
This is an empirical measurement study that collects LLM responses to fixed prompts across identity groups, applies classification rules or models to label hedging/non-affirmation, and reports statistical associations (effect sizes, predictors). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the derivation of the central claims. The identity-effect comparison rests on observed data rather than any reduction to the measurement pipeline by construction. Self-citations, if present, are not invoked to justify uniqueness or forbid alternatives. The study is therefore self-contained against external benchmarks and receives the default non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human rights apply unambiguously to all groups
Reference graph
Works this paper leans on
- [1]
-
[2]
Google AI. 2025. Our Principles. https://ai.google/responsibility/principles/
work page 2025
-
[3]
Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/ claudes-constitution
work page 2023
-
[4]
Anthropic. 2023. Collective Constitutional AI: Aligning a Language Model with Public Input . https://www.anthropic.com/news/collective-constitutional-ai- aligning-a-language-model-with-public-input
work page 2023
-
[5]
UN General Assembly. 2024. Seizing the opportunities of safe, secure and trustworthy artificial intelligence systems for sustainable development: reso- lution/adopted by the General Assembly. (2024)
work page 2024
- [6]
-
[7]
Corinne Cath, Mark Latonero, Vidushi Marda, and Roya Pakzad. 2020. Leap of FATE: human rights as a complementary framework for AI policy and practice. In Proceedings of the 2020 conference on fairness, accountability, and transparency . 702–702
work page 2020
-
[8]
Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. 2020. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7659–7666
work page 2020
-
[9]
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruk- sachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 862–872
work page 2021
-
[10]
International Monetary Fund. 2024. World Economic Outlook Database. https: //www.imf.org/en/Publications/WEO/weo-database/2024/April/weo-report
work page 2024
-
[11]
Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437
work page 2020
-
[12]
Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers. Annals of biomedical engineering 51, 12 (2023), 2629–2633
work page 2023
-
[13]
Przemyslaw A Grabowicz, Nicholas Perello, and Aarshee Mishra. 2022. Marrying fairness and explainability in supervised learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency . 1905–1916
work page 2022
-
[14]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[15]
Corinna Hertweck, Christoph Heitz, and Michele Loi. 2021. On the moral justifi- cation of statistical parity. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 747–757
work page 2021
-
[16]
J Joseph Hewitt, Jonathan Wilkenfeld, and Ted Robert Gurr. 2017. Peace and conflict 2010. Routledge
work page 2017
-
[17]
Cindy Holder and David Reidy. 2013. Human rights: The hard questions . Cam- bridge University Press
work page 2013
-
[18]
Steven LB Jensen. 2016. The making of international human rights: the 1960s, decolonization, and the reconstruction of global values . Cambridge University Press
work page 2016
- [19]
-
[20]
Surya Mattu Julia Angwin, Jeff Larson and ProPublica Lauren Kirchner. 2016.Ma- chine Bias. https://www.propublica.org/article/machine-bias-risk-assessments- in-criminal-sentencing
work page 2016
-
[21]
Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I. World Scientific, 99–127
work page 2013
-
[22]
Gauri Kambhatla, Ian Stewart, and Rada Mihalcea. 2022. Surfacing racial stereo- types through identity portrayal. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1604–1615
work page 2022
-
[23]
Atoosa Kasirzadeh and Iason Gabriel. 2023. In conversation with artificial intelli- gence: aligning language models with human values. Philosophy & Technology 36, 2 (2023), 27
work page 2023
- [24]
- [25]
-
[26]
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al
-
[27]
arXiv preprint arXiv:2404.16019 (2024)
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019 (2024)
-
[28]
George Lakoff. 1973. Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic 2, 4 (1973), 458–508
work page 1973
-
[29]
Jay L Lemke. 1992. Interpersonal meaning in discourse: Value orientations. Advances in systemic linguistics: Recent theory and practice 82 (1992), 104–126
work page 1992
- [30]
-
[31]
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
- [33]
-
[34]
Sorin Adam Matei and Caius Dobrescu. 2011. Wikipedia’s “neutral point of view”: Settling conflict through ambiguity. The Information Society 27, 1 (2011), 40–51
work page 2011
-
[35]
Lorna McGregor, Daragh Murray, and Vivian Ng. 2019. International human rights law as a framework for algorithmic accountability. International & Com- parative Law Quarterly 68, 2 (2019), 309–343
work page 2019
-
[36]
Merriam-Webster. [n. d.]. Hedge. https://www.merriam-webster.com/dictionary/ hedge
-
[37]
Bertalan Meskó. 2023. Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of medical Internet research 25 (2023), e50638
work page 2023
-
[38]
PG Meyer. 1997. Hedging strategies in written academic discourse: Strengthening the argument by weakening the claim. Hedging and discourse: Approaches to the analysis of a pragmatic phenomenon in academic texts/Walter de Gruyter & Co (1997)
work page 1997
- [39]
-
[40]
Bureau of Cyberspace and Digital Policy. [n. d.]. Risk Management Profile for Ar- tificial Intelligence and Human Rights. https://www.state.gov/risk-management- profile-for-ai-and-human-rights/#fn4 Accessed: 01/10/2025
work page 2025
-
[41]
OpenAI. 2024. Usage Policies. https://openai.com/policies/usage-policies/
work page 2024
-
[42]
Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, and Iason Gabriel
-
[43]
arXiv preprint arXiv:2210.02667 (2022)
A human rights-based approach to responsible AI. arXiv preprint arXiv:2210.02667 (2022)
-
[44]
David Quinn. 2020. Self-determination movements and their outcomes. In Peace and Conflict 2008. Routledge, 33–38
work page 2020
-
[45]
Tim Räz. 2021. Group fairness: Independence revisited. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 129–137
work page 2021
-
[46]
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems . 1–7
work page 2021
-
[47]
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect?. In International Conference on Machine Learning . PMLR, 29971–30004
work page 2023
-
[48]
Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. 2024. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems 36 (2024). Conference’25, July 2025, Athens, Greece Javed et al
work page 2024
-
[49]
Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence 4, 3 (2022), 258–268
work page 2022
-
[50]
Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al
-
[51]
In Proceedings of the AAAI Conference on Artificial Intelligence , Vol
Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 19937–19947
-
[52]
Olivia Steiert. 2024. Declaring crisis? Temporal constructions of climate change on Wikipedia. Public Understanding of Science (2024), 09636625241268890
work page 2024
-
[53]
Fritz Strack and Leonard L Martin. 1987. Thinking, judging, and communicating: A process account of context effects in attitude surveys. In Social information processing and survey methodology . Springer, 123–148
work page 1987
-
[54]
Karel Vasak. 1977. A 30-year struggle; the sustained efforts to give force of law to the Universal Declaration of Human Rights. (1977)
work page 1977
-
[55]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837
work page 2022
- [56]
-
[57]
Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society . 335–340. Do LLMs exhibit demographic parity in responses to queries about Human Rights? Conference’25, July 2025, Athens, Greece A Appendix A.1 Selection of identiti...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.