Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Abdullah Zaini; Anushe Sheikh; Cassandra Parent; David Yanni; Iason Gabriel; Jackie Kay; Laura Weidinger; Maribeth Rauh; Marzyeh Ghassemi; Rafiya Javed

arxiv: 2502.19463 · v2 · submitted 2025-02-26 · 💻 cs.CY · cs.AI· cs.SI

Hedging and Non-Affirmation: Quantifying LLM Alignment on Questions of Human Rights

Rafiya Javed , Cassandra Parent , Jackie Kay , David Yanni , Abdullah Zaini , Anushe Sheikh , Maribeth Rauh , Walter Gerych

show 4 more authors

Ramona Comanescu Iason Gabriel Marzyeh Ghassemi Laura Weidinger

This is my paper

Pith reviewed 2026-05-23 02:22 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.SI

keywords LLM alignmenthuman rightshedging behaviornon-affirmationgroup identitydebiasingsteering vectors

0 comments

The pith

Group identity is the strongest predictor of hedging and non-affirmation in LLM answers to human rights questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to quantify how often LLMs hedge or refuse to affirm human rights statements when the query names a specific national or ethnic identity. Across 4738 prompts and seven models, four models showed response patterns that depended significantly on the named group, and the size of that dependence exceeded the effects of conflict signals, statelessness, or national GDP. The authors then test mitigation on open-weight models and find that steering vectors tied to group identities reduce the unwanted behaviors more effectively than other approaches while remaining robust to prompt rephrasing. A sympathetic reader would care because human rights are presented as universal, yet the models introduce group-dependent differences in how clearly they endorse them.

Core claim

Four of the seven LLMs exhibit hedging and non-affirmation that is statistically dependent on the queried identity, with identity producing larger effect sizes than conflict, sovereignty, or economic indicators. Group steering applied to open-weight models reduces these rates across query types and does not produce downstream forgetting, making it the strongest mitigation method identified.

What carries the argument

The measurement framework that classifies unconstrained LLM responses as hedging or non-affirmation across 205 identities, paired with group-identity steering vectors for debiasing.

If this is right

Identity produces larger disparities in affirmation than conflict signals or economic status.
Four of the seven tested models display measurable identity-linked differences in human-rights responses.
Group steering reduces hedging rates more effectively than other tested debiasing methods.
The observed patterns remain stable when prompts are rephrased.
Steering vectors do not cause measurable forgetting on unrelated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar identity-linked patterns may appear in other domains that require uniform ethical judgments.
Alignment methods could be extended to neutralize group-specific activations rather than only prompt-level fixes.
Downstream applications such as legal or policy drafting may inherit these disparities unless steering is applied.

Load-bearing premise

The classification rules used to label responses as hedging or non-affirmation accurately capture model intent and are not artifacts of prompt wording or evaluator bias.

What would settle it

A re-evaluation of the same prompts with an independent classification method or with human raters who are blind to the identity labels finds no significant dependence on group identity.

Figures

Figures reproduced from arXiv: 2502.19463 by Abdullah Zaini, Anushe Sheikh, Cassandra Parent, David Yanni, Iason Gabriel, Jackie Kay, Laura Weidinger, Maribeth Rauh, Marzyeh Ghassemi, Rafiya Javed, Ramona Comanescu, Walter Gerych.

**Figure 2.** Figure 2: Baseline rates of hedging and non-affirmation re [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mean rates of hedging and non-affirmation per [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 7.** Figure 7: This reveals a constant gap in metrics between high - [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 5.** Figure 5: Per-group Statistical Parity Difference was calcu [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Per-group Statistical Parity Difference is shown [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Disparity between high- and low- endorsement iden [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Hedging and non-affirmation are behaviors exhibited by large language models (LLMs) that limit the clear endorsement of specific statements. While these behaviors are desirable in subjective contexts, they are undesirable in the context of human rights - which apply unambiguously to all groups. We present a systematic framework to measure these behaviors in unconstrained LLM responses regarding various identity groups. We evaluate six large proprietary models as well as one open-weight LLM on 4738 prompts across 205 national and stateless ethnic identities and find that 4 out of 7 display hedging and non-affirmation that is significantly dependent on the identity of the group. While factors like conflict signals, sovereignty (whether identity is stateless), or economic indicators (GDP) also influence model behavior, their effect sizes are consistently weaker than the impact of identity itself. The systematic disparity is robust to methods of rephrasing the prompts. Since group identity is the strongest predictor of these behaviors, we use open-weight models to explore whether applying steering and orthogonalization techniques to these group identities can mitigate the rates of hedging and non-affirmation behaviors. We find that group steering is the most effective debiasing approach across query types and is robust to downstream forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures identity-linked hedging on human rights across many groups and models, finds identity beats other predictors, and shows group steering reduces it.

read the letter

Group identity is the strongest driver of hedging and non-affirmation in LLM answers to human rights questions, stronger than conflict signals or GDP, and steering on the identity vector cuts the behavior most effectively. That is the core result from the 4738-prompt evaluation on seven models and 205 identities. Four of the seven models show clear identity dependence. The framework itself is new in its scale and in directly comparing effect sizes across identity, sovereignty, and economic factors while testing mitigation on open models. The rephrasing robustness check is a solid step, and the finding that group steering outperforms other debiasing approaches is useful for anyone working on consistent model behavior. The main soft spot is the outcome labeling. The abstract only shows that changing prompt wording does not erase the identity pattern; it does not test whether the downstream classification of responses as hedging or non-affirmation itself varies with group or cultural framing. If the classifier or coders apply different thresholds across identities, the reported effect sizes become harder to trust. The paper would be tighter with explicit validation of the labeling step against that risk. This work is for alignment and fairness researchers who need concrete measurements of disparate treatment on unambiguous claims. It is worth sending to peer review because the scale is real and the question matters, even if the classification details will require extra scrutiny from referees.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a systematic framework for quantifying hedging and non-affirmation in LLM responses to human rights questions across 205 national and stateless ethnic identities. Using 4738 prompts on six proprietary and one open-weight model, it reports that four of seven models exhibit statistically significant identity-dependent disparities in these behaviors. Group identity is presented as the strongest predictor, with larger effect sizes than conflict signals, sovereignty status, or GDP; prompt rephrasing does not eliminate the pattern. The authors further show that group steering on open-weight models is the most effective mitigation, remaining robust to downstream forgetting.

Significance. If the measurement pipeline is valid, the work provides a large-scale empirical demonstration (4738 prompts, 205 identities) that LLM alignment on universal human rights is identity-sensitive, with effect-size comparisons across multiple covariates and an explicit test of a mitigation technique (group steering). The scale of the evaluation and the steering experiments constitute concrete strengths that would advance understanding of fairness failures in deployed models.

major comments (2)

[Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.
[Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.

minor comments (2)

[Experimental Setup] Clarify whether the 4738 prompts are balanced across the 205 identities or whether some groups receive disproportionately more queries; report the exact distribution.
[Methods] Provide the precise operational definitions and any inter-annotator agreement statistics for the hedging/non-affirmation labels, even if only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our measurement and analysis pipeline. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / Response Classification] The central claim that group identity is the strongest predictor (with effect sizes exceeding those of conflict, sovereignty, and GDP) rests on the binary/continuous outcome variable for hedging and non-affirmation. The abstract states robustness to prompt rephrasing, yet this only perturbs the input; no validation is described for whether the downstream labeling procedure (automated classifier, rule-based, or human coders) itself varies systematically with identity group or cultural framing. Without such a check, the reported disparities and the ranking of effect sizes are vulnerable to measurement artifact.

Authors: We agree that an explicit check for identity-dependent bias in the labeling procedure is necessary to rule out measurement artifacts. Our current robustness section focuses on prompt rephrasing, but does not include a dedicated validation of the classifier across groups. In revision, we will add a new subsection describing a human-annotated validation set (stratified by identity) and report inter-annotator agreement plus any systematic differences in automated labels by group. If disparities are found, we will quantify their impact on the main results. revision: yes
Referee: [Results / Effect Size Analysis] The comparison of effect sizes across predictors requires an explicit multivariate specification (e.g., logistic or linear regression with all covariates entered simultaneously, plus standardized coefficients or partial R² values). If the identity variable is entered after or without controls for multicollinearity with sovereignty or conflict signals, the claim that identity dominates cannot be evaluated from the reported results.

Authors: We concur that separate univariate comparisons, while informative, do not fully address potential multicollinearity or joint explanatory power. The revised manuscript will include multivariate logistic regressions with all covariates (identity, conflict signals, sovereignty, GDP) entered simultaneously. We will report standardized coefficients, variance inflation factors, and partial R² values to allow direct comparison of effect sizes under controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurement study

full rationale

This is an empirical measurement study that collects LLM responses to fixed prompts across identity groups, applies classification rules or models to label hedging/non-affirmation, and reports statistical associations (effect sizes, predictors). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the derivation of the central claims. The identity-effect comparison rests on observed data rather than any reduction to the measurement pipeline by construction. Self-citations, if present, are not invoked to justify uniqueness or forbid alternatives. The study is therefore self-contained against external benchmarks and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that hedging is undesirable for human rights statements and that the chosen prompts isolate identity effects; no free parameters or invented entities are described.

axioms (1)

domain assumption Human rights apply unambiguously to all groups
Explicitly stated as the context making hedging undesirable.

pith-pipeline@v0.9.0 · 5797 in / 1021 out tokens · 39753 ms · 2026-05-23T02:22:00.518386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 2 internal anchors

[1]

Ahmed Agiza, Mohamed Mostagir, and Sherief Reda. 2024. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs. arXiv preprint arXiv:2404.08699 (2024)

work page arXiv 2024
[2]

Google AI. 2025. Our Principles. https://ai.google/responsibility/principles/

work page 2025
[3]

Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/ claudes-constitution

work page 2023
[4]

Anthropic. 2023. Collective Constitutional AI: Aligning a Language Model with Public Input . https://www.anthropic.com/news/collective-constitutional-ai- aligning-a-language-model-with-public-input

work page 2023
[5]

UN General Assembly. 2024. Seizing the opportunities of safe, secure and trustworthy artificial intelligence systems for sustainable development: reso- lution/adopted by the General Assembly. (2024)

work page 2024
[6]

Maarten Buyl, Alexander Rogiers, Sander Noels, Iris Dominguez-Catena, Edith Heiter, Raphael Romero, Iman Johary, Alexandru-Cristian Mara, Jefrey Lijffijt, and Tijl De Bie. 2024. Large language models reflect the ideology of their creators. arXiv preprint arXiv:2410.18417 (2024)

work page arXiv 2024
[7]

Corinne Cath, Mark Latonero, Vidushi Marda, and Roya Pakzad. 2020. Leap of FATE: human rights as a complementary framework for AI policy and practice. In Proceedings of the 2020 conference on fairness, accountability, and transparency . 702–702

work page 2020
[8]

Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. 2020. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7659–7666

work page 2020
[9]

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruk- sachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 862–872

work page 2021
[10]

International Monetary Fund. 2024. World Economic Outlook Database. https: //www.imf.org/en/Publications/WEO/weo-database/2024/April/weo-report

work page 2024
[11]

Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437

work page 2020
[12]

Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers. Annals of biomedical engineering 51, 12 (2023), 2629–2633

work page 2023
[13]

Przemyslaw A Grabowicz, Nicholas Perello, and Aarshee Mishra. 2022. Marrying fairness and explainability in supervised learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency . 1905–1916

work page 2022
[14]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

Corinna Hertweck, Christoph Heitz, and Michele Loi. 2021. On the moral justifi- cation of statistical parity. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 747–757

work page 2021
[16]

J Joseph Hewitt, Jonathan Wilkenfeld, and Ted Robert Gurr. 2017. Peace and conflict 2010. Routledge

work page 2017
[17]

Cindy Holder and David Reidy. 2013. Human rights: The hard questions . Cam- bridge University Press

work page 2013
[18]

Steven LB Jensen. 2016. The making of international human rights: the 1960s, decolonization, and the reconstruction of global values . Cambridge University Press

work page 2016
[19]

Hang Jiang, Doug Beeferman, Brandon Roy, and Deb Roy. 2022. Commu- nityLM: Probing partisan worldviews from language models. arXiv preprint arXiv:2209.07065 (2022)

work page arXiv 2022
[20]

2016.Ma- chine Bias

Surya Mattu Julia Angwin, Jeff Larson and ProPublica Lauren Kirchner. 2016.Ma- chine Bias. https://www.propublica.org/article/machine-bias-risk-assessments- in-criminal-sentencing

work page 2016
[21]

Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I. World Scientific, 99–127

work page 2013
[22]

Gauri Kambhatla, Ian Stewart, and Rada Mihalcea. 2022. Surfacing racial stereo- types through identity portrayal. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1604–1615

work page 2022
[23]

Atoosa Kasirzadeh and Iason Gabriel. 2023. In conversation with artificial intelli- gence: aligning language models with human values. Philosophy & Technology 36, 2 (2023), 27

work page 2023
[24]

Daniel Kazenwadel and Christoph V Steinert. 2023. How User Language Affects Conflict Fatality Estimates in ChatGPT. arXiv preprint arXiv:2308.00072 (2023)

work page arXiv 2023
[25]

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Miku- lik, and Geoffrey Irving. 2021. Alignment of language agents. arXiv preprint arXiv:2103.14659 (2021)

work page arXiv 2021
[26]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al

work page
[27]

arXiv preprint arXiv:2404.16019 (2024)

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019 (2024)

work page arXiv 2024
[28]

George Lakoff. 1973. Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic 2, 4 (1973), 458–508

work page 1973
[29]

Jay L Lemke. 1992. Interpersonal meaning in discourse: Value orientations. Advances in systemic linguistics: Recent theory and practice 82 (1992), 104–126

work page 1992
[30]

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149 (2023)

work page arXiv 2023
[31]

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)

work page arXiv 2021
[33]

Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, and Alexander D’Amour. 2024. Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation. arXiv preprint arXiv:2402.12649 (2024)

work page arXiv 2024
[34]

neutral point of view

Sorin Adam Matei and Caius Dobrescu. 2011. Wikipedia’s “neutral point of view”: Settling conflict through ambiguity. The Information Society 27, 1 (2011), 40–51

work page 2011
[35]

Lorna McGregor, Daragh Murray, and Vivian Ng. 2019. International human rights law as a framework for algorithmic accountability. International & Com- parative Law Quarterly 68, 2 (2019), 309–343

work page 2019
[36]

Merriam-Webster. [n. d.]. Hedge. https://www.merriam-webster.com/dictionary/ hedge

work page
[37]

Bertalan Meskó. 2023. Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of medical Internet research 25 (2023), e50638

work page 2023
[38]

PG Meyer. 1997. Hedging strategies in written academic discourse: Strengthening the argument by weakening the claim. Hedging and discourse: Approaches to the analysis of a pragmatic phenomenon in academic texts/Walter de Gruyter & Co (1997)

work page 1997
[39]

Jared Moore, Tanvi Deshpande, and Diyi Yang. 2024. Are Large Language Models Consistent over Value-laden Questions? arXiv preprint arXiv:2407.02996 (2024)

work page arXiv 2024
[40]

Bureau of Cyberspace and Digital Policy. [n. d.]. Risk Management Profile for Ar- tificial Intelligence and Human Rights. https://www.state.gov/risk-management- profile-for-ai-and-human-rights/#fn4 Accessed: 01/10/2025

work page 2025
[41]

OpenAI. 2024. Usage Policies. https://openai.com/policies/usage-policies/

work page 2024
[42]

Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, and Iason Gabriel

work page
[43]

arXiv preprint arXiv:2210.02667 (2022)

A human rights-based approach to responsible AI. arXiv preprint arXiv:2210.02667 (2022)

work page arXiv 2022
[44]

David Quinn. 2020. Self-determination movements and their outcomes. In Peace and Conflict 2008. Routledge, 33–38

work page 2020
[45]

Tim Räz. 2021. Group fairness: Independence revisited. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 129–137

work page 2021
[46]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems . 1–7

work page 2021
[47]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect?. In International Conference on Machine Learning . PMLR, 29971–30004

work page 2023
[48]

Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. 2024. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems 36 (2024). Conference’25, July 2025, Athens, Greece Javed et al

work page 2024
[49]

Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence 4, 3 (2022), 258–268

work page 2022
[50]

Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al

work page
[51]

In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 19937–19947

work page
[52]

Olivia Steiert. 2024. Declaring crisis? Temporal constructions of climate change on Wikipedia. Public Understanding of Science (2024), 09636625241268890

work page 2024
[53]

Fritz Strack and Leonard L Martin. 1987. Thinking, judging, and communicating: A process account of context effects in attitude surveys. In Social information processing and survey methodology . Springer, 123–148

work page 1987
[54]

Karel Vasak. 1977. A 30-year struggle; the sustained efforts to give force of law to the Universal Declaration of Human Rights. (1977)

work page 1977
[55]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022
[56]

Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can Large Language Mod- els Faithfully Express Their Intrinsic Uncertainty in Words? arXiv preprint arXiv:2405.16908 (2024)

work page arXiv 2024
[57]

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society . 335–340. Do LLMs exhibit demographic parity in responses to queries about Human Rights? Conference’25, July 2025, Athens, Greece A Appendix A.1 Selection of identiti...

work page 2018

[1] [1]

Ahmed Agiza, Mohamed Mostagir, and Sherief Reda. 2024. Analyzing the Impact of Data Selection and Fine-Tuning on Economic and Political Biases in LLMs. arXiv preprint arXiv:2404.08699 (2024)

work page arXiv 2024

[2] [2]

Google AI. 2025. Our Principles. https://ai.google/responsibility/principles/

work page 2025

[3] [3]

Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/ claudes-constitution

work page 2023

[4] [4]

Anthropic. 2023. Collective Constitutional AI: Aligning a Language Model with Public Input . https://www.anthropic.com/news/collective-constitutional-ai- aligning-a-language-model-with-public-input

work page 2023

[5] [5]

UN General Assembly. 2024. Seizing the opportunities of safe, secure and trustworthy artificial intelligence systems for sustainable development: reso- lution/adopted by the General Assembly. (2024)

work page 2024

[6] [6]

Maarten Buyl, Alexander Rogiers, Sander Noels, Iris Dominguez-Catena, Edith Heiter, Raphael Romero, Iman Johary, Alexandru-Cristian Mara, Jefrey Lijffijt, and Tijl De Bie. 2024. Large language models reflect the ideology of their creators. arXiv preprint arXiv:2410.18417 (2024)

work page arXiv 2024

[7] [7]

Corinne Cath, Mark Latonero, Vidushi Marda, and Roya Pakzad. 2020. Leap of FATE: human rights as a complementary framework for AI policy and practice. In Proceedings of the 2020 conference on fairness, accountability, and transparency . 702–702

work page 2020

[8] [8]

Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. 2020. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7659–7666

work page 2020

[9] [9]

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruk- sachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 862–872

work page 2021

[10] [10]

International Monetary Fund. 2024. World Economic Outlook Database. https: //www.imf.org/en/Publications/WEO/weo-database/2024/April/weo-report

work page 2024

[11] [11]

Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437

work page 2020

[12] [12]

Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers. Annals of biomedical engineering 51, 12 (2023), 2629–2633

work page 2023

[13] [13]

Przemyslaw A Grabowicz, Nicholas Perello, and Aarshee Mishra. 2022. Marrying fairness and explainability in supervised learning. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency . 1905–1916

work page 2022

[14] [14]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [15]

Corinna Hertweck, Christoph Heitz, and Michele Loi. 2021. On the moral justifi- cation of statistical parity. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 747–757

work page 2021

[16] [16]

J Joseph Hewitt, Jonathan Wilkenfeld, and Ted Robert Gurr. 2017. Peace and conflict 2010. Routledge

work page 2017

[17] [17]

Cindy Holder and David Reidy. 2013. Human rights: The hard questions . Cam- bridge University Press

work page 2013

[18] [18]

Steven LB Jensen. 2016. The making of international human rights: the 1960s, decolonization, and the reconstruction of global values . Cambridge University Press

work page 2016

[19] [19]

Hang Jiang, Doug Beeferman, Brandon Roy, and Deb Roy. 2022. Commu- nityLM: Probing partisan worldviews from language models. arXiv preprint arXiv:2209.07065 (2022)

work page arXiv 2022

[20] [20]

2016.Ma- chine Bias

Surya Mattu Julia Angwin, Jeff Larson and ProPublica Lauren Kirchner. 2016.Ma- chine Bias. https://www.propublica.org/article/machine-bias-risk-assessments- in-criminal-sentencing

work page 2016

[21] [21]

Daniel Kahneman and Amos Tversky. 2013. Prospect theory: An analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I. World Scientific, 99–127

work page 2013

[22] [22]

Gauri Kambhatla, Ian Stewart, and Rada Mihalcea. 2022. Surfacing racial stereo- types through identity portrayal. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1604–1615

work page 2022

[23] [23]

Atoosa Kasirzadeh and Iason Gabriel. 2023. In conversation with artificial intelli- gence: aligning language models with human values. Philosophy & Technology 36, 2 (2023), 27

work page 2023

[24] [24]

Daniel Kazenwadel and Christoph V Steinert. 2023. How User Language Affects Conflict Fatality Estimates in ChatGPT. arXiv preprint arXiv:2308.00072 (2023)

work page arXiv 2023

[25] [25]

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Miku- lik, and Geoffrey Irving. 2021. Alignment of language agents. arXiv preprint arXiv:2103.14659 (2021)

work page arXiv 2021

[26] [26]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al

work page

[27] [27]

arXiv preprint arXiv:2404.16019 (2024)

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models. arXiv preprint arXiv:2404.16019 (2024)

work page arXiv 2024

[28] [28]

George Lakoff. 1973. Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal of philosophical logic 2, 4 (1973), 458–508

work page 1973

[29] [29]

Jay L Lemke. 1992. Interpersonal meaning in discourse: Value orientations. Advances in systemic linguistics: Recent theory and practice 82 (1992), 104–126

work page 1992

[30] [30]

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149 (2023)

work page arXiv 2023

[31] [31]

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. 2023. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)

work page arXiv 2021

[33] [33]

Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, and Alexander D’Amour. 2024. Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation. arXiv preprint arXiv:2402.12649 (2024)

work page arXiv 2024

[34] [34]

neutral point of view

Sorin Adam Matei and Caius Dobrescu. 2011. Wikipedia’s “neutral point of view”: Settling conflict through ambiguity. The Information Society 27, 1 (2011), 40–51

work page 2011

[35] [35]

Lorna McGregor, Daragh Murray, and Vivian Ng. 2019. International human rights law as a framework for algorithmic accountability. International & Com- parative Law Quarterly 68, 2 (2019), 309–343

work page 2019

[36] [36]

Merriam-Webster. [n. d.]. Hedge. https://www.merriam-webster.com/dictionary/ hedge

work page

[37] [37]

Bertalan Meskó. 2023. Prompt engineering as an important emerging skill for medical professionals: tutorial. Journal of medical Internet research 25 (2023), e50638

work page 2023

[38] [38]

PG Meyer. 1997. Hedging strategies in written academic discourse: Strengthening the argument by weakening the claim. Hedging and discourse: Approaches to the analysis of a pragmatic phenomenon in academic texts/Walter de Gruyter & Co (1997)

work page 1997

[39] [39]

Jared Moore, Tanvi Deshpande, and Diyi Yang. 2024. Are Large Language Models Consistent over Value-laden Questions? arXiv preprint arXiv:2407.02996 (2024)

work page arXiv 2024

[40] [40]

Bureau of Cyberspace and Digital Policy. [n. d.]. Risk Management Profile for Ar- tificial Intelligence and Human Rights. https://www.state.gov/risk-management- profile-for-ai-and-human-rights/#fn4 Accessed: 01/10/2025

work page 2025

[41] [41]

OpenAI. 2024. Usage Policies. https://openai.com/policies/usage-policies/

work page 2024

[42] [42]

Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, and Iason Gabriel

work page

[43] [43]

arXiv preprint arXiv:2210.02667 (2022)

A human rights-based approach to responsible AI. arXiv preprint arXiv:2210.02667 (2022)

work page arXiv 2022

[44] [44]

David Quinn. 2020. Self-determination movements and their outcomes. In Peace and Conflict 2008. Routledge, 33–38

work page 2020

[45] [45]

Tim Räz. 2021. Group fairness: Independence revisited. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency . 129–137

work page 2021

[46] [46]

Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended abstracts of the 2021 CHI conference on human factors in computing systems . 1–7

work page 2021

[47] [47]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose opinions do language models reflect?. In International Conference on Machine Learning . PMLR, 29971–30004

work page 2023

[48] [48]

Nino Scherrer, Claudia Shi, Amir Feder, and David Blei. 2024. Evaluating the moral beliefs encoded in llms. Advances in Neural Information Processing Systems 36 (2024). Conference’25, July 2025, Athens, Greece Javed et al

work page 2024

[49] [49]

Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian Kersting. 2022. Large pre-trained language models contain human- like biases of what is right and wrong to do. Nature Machine Intelligence 4, 3 (2022), 258–268

work page 2022

[50] [50]

Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al

work page

[51] [51]

In Proceedings of the AAAI Conference on Artificial Intelligence , Vol

Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 38. 19937–19947

work page

[52] [52]

Olivia Steiert. 2024. Declaring crisis? Temporal constructions of climate change on Wikipedia. Public Understanding of Science (2024), 09636625241268890

work page 2024

[53] [53]

Fritz Strack and Leonard L Martin. 1987. Thinking, judging, and communicating: A process account of context effects in attitude surveys. In Social information processing and survey methodology . Springer, 123–148

work page 1987

[54] [54]

Karel Vasak. 1977. A 30-year struggle; the sustained efforts to give force of law to the Universal Declaration of Human Rights. (1977)

work page 1977

[55] [55]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022

[56] [56]

Gal Yona, Roee Aharoni, and Mor Geva. 2024. Can Large Language Mod- els Faithfully Express Their Intrinsic Uncertainty in Words? arXiv preprint arXiv:2405.16908 (2024)

work page arXiv 2024

[57] [57]

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society . 335–340. Do LLMs exhibit demographic parity in responses to queries about Human Rights? Conference’25, July 2025, Athens, Greece A Appendix A.1 Selection of identiti...

work page 2018