What Do People Actually Want From AI? Mapping Preference Plurality

Julia Sep\'ulveda Coelho; Scott A. Hale

arxiv: 2606.06674 · v1 · pith:6MW5DLPQnew · submitted 2026-06-04 · 💻 cs.CL · cs.CY

What Do People Actually Want From AI? Mapping Preference Plurality

Julia Sep\'ulveda Coelho , Scott A. Hale This is my paper

Pith reviewed 2026-06-28 01:29 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords AI alignmentRLHFpreference pluralitytruthfulnesshuman feedbackpreference elicitationepistemic violence

0 comments

The pith

People's preferences for AI diverge sharply, with even 'truthfulness' carrying incompatible meanings across respondents that single reward models cannot capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 1,500 open-ended responses from the PRISM dataset across 75 countries to identify what users actually want from AI systems. It shows that most requested values come from fewer than a quarter of people, except for truthfulness at 49 percent. The same terms reveal divergent definitions, such as truthfulness meaning sourced claims for some and expert opinions or unpopular views for others. Capabilities like human-like behavior and features like guardrails prove controversial, with some wanting them and others rejecting them. People also draw contextual lines, such as default versus requested behaviors, that binary comparison methods miss.

Core claim

The central claim is that preference plurality and semantic divergence in open-ended responses expose fundamental limits in RLHF alignment: when nearly half request truthfulness but define it differently, and when contextual distinctions exceed binary comparisons, current aggregation into a single reward model flattens situated signals and fails to match actual user demands, as seen in persistent high hallucination rates despite clear accuracy preferences.

What carries the argument

Qualitative coding of open-ended responses from the PRISM dataset that surfaces preference plurality and incompatible epistemological bases behind shared terms.

If this is right

A single reward model is unlikely to satisfy the varied definitions of truthfulness.
Binary comparisons cannot encode distinctions between default and requested behaviors.
Controversial features like guardrails will produce both demand and rejection.
Persistent hallucinations indicate that current methods do not identify the accuracy preferences expressed in the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment systems may require multiple or context-switching models rather than one universal preference function.
Future preference datasets would benefit from including open-ended questions alongside ratings to avoid flattening signals.
The findings connect to broader questions about how to handle value pluralism when building public AI systems.

Load-bearing premise

The qualitative reading of divergent meanings in the responses correctly identifies incompatible bases without substantial researcher bias or sampling effects.

What would settle it

A single reward model trained only on binary comparisons that produces outputs matching the full range of definitions for truthfulness given in the open responses.

read the original abstract

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows plurality in what people want from AI via open-ended PRISM responses, but the qualitative claims on meaning divergence rest on unvalidated coding.

read the letter

Colleague,

The core observation is that open-ended answers from 1500 PRISM responses across 75 countries show most requested values come from under a quarter of people, with truthfulness at 49% but carrying incompatible interpretations like sourced claims versus expert opinion versus unpopular views. This is presented as evidence that single reward models cannot capture the spread.

The shift to open-ended data is the clearest difference from standard preference work. It surfaces that some capabilities are outright divisive and that users draw contextual lines binary comparisons would miss. That part is useful for anyone thinking about why alignment techniques still produce outputs that contradict stated user priorities.

The weak point is the interpretive layer. The claims about divergent epistemological bases depend on how the responses were coded and labeled as incompatible. The abstract supplies no numbers on coders, agreement rates, or steps to check for framing effects, and the stress-test note flags the same gap. Without those details it is difficult to separate real divergence from researcher reading. The sample diversity is a plus, but the lack of reported demographics or exclusion rules leaves the same uncertainty.

This is for alignment researchers who already suspect current aggregation methods lose signal. It does not deliver a new method or dataset that would change practice on its own.

It should go to peer review. The question is worth asking, but the methods section needs to be explicit enough for others to evaluate the coding and the leap from lexical variation to structural limits on reward models.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 1,500 open-ended responses from the PRISM dataset across 75 countries to map user preferences for AI systems. It reports that preferences are highly diverse, with most values requested by fewer than 25% of respondents and truthfulness the only exception at 49%; however, open-ended elaborations on truthfulness reveal divergent epistemological bases (sourced claims vs. expert opinion vs. unpopular views). The work further identifies controversial features (e.g., human-likeness, guardrails) and contextual distinctions (default vs. requested behavior) that binary preference comparisons cannot capture, concluding that these pluralities cannot be aggregated into single reward models and that current RLHF practices therefore fail to identify actual preferences.

Significance. If the qualitative interpretations are robust, the findings provide concrete empirical grounding for known limitations of RLHF, showing that lexical agreement on values like truthfulness masks incompatible underlying demands. This could motivate development of alignment methods that handle preference plurality and context rather than forcing aggregation, and the use of a large multi-country open-ended dataset is a strength relative to typical binary-feedback studies.

major comments (2)

[Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.
[Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.

minor comments (2)

[Abstract] The abstract is dense and could be split for clarity; the phrase 'epistemic violence' is used without a direct citation to the source work that introduced the characterization.
[Results] Figure or table summarizing the distribution of requested values (beyond the 49% truthfulness figure) would help readers assess the 'most values requested by fewer than a quarter' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses

Referee: [Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.

Authors: We agree that the manuscript lacks sufficient detail on the qualitative coding process, which is important for evaluating the robustness of the incompatibility claims. The coding was conducted iteratively by the lead authors through repeated review and discussion to identify and categorize the distinct epistemological bases (e.g., sourced claims, expert opinion, unpopular views). No formal inter-rater reliability statistic such as Cohen's kappa was computed, and coder training was informal. In revision, we will add a dedicated Methods subsection describing the full protocol, the number of coders, the consensus process, and any steps taken to reduce framing bias (such as using multiple independent readings before joint discussion). This addition will directly support the interpretation without altering the core findings. revision: yes
Referee: [Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.

Authors: We acknowledge that the manuscript does not include a self-contained summary of the PRISM sample characteristics, exclusion rules, or representativeness, which limits assessment of generalizability. Although the PRISM dataset paper provides these details, the current work does not extract or discuss them. We will revise the manuscript to add a subsection on the dataset and sample, including available demographic breakdowns, any exclusion criteria applied to arrive at the 1,500 responses, and an explicit discussion of representativeness limitations. This will better ground the claims about preferences across populations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical survey with no derivation or fitting chain

full rationale

The paper is a qualitative and quantitative analysis of 1,500 open-ended survey responses from the PRISM dataset. It reports frequencies (e.g., truthfulness requested by 49%), codes open responses for divergent meanings, and draws interpretive conclusions about alignment methods. No equations, parameters, predictions, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained against external benchmarks (survey data) and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical qualitative study whose conclusions rest on the validity of the PRISM dataset and the authors' interpretations of free-text responses.

axioms (1)

domain assumption The 1,500 open-ended responses in the PRISM dataset provide representative and unbiased signals of what people actually want from AI.
All claims about preference diversity and failures of current alignment methods depend on this premise.

pith-pipeline@v0.9.1-grok · 5834 in / 1138 out tokens · 35206 ms · 2026-06-28T01:29:40.037063+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 70 canonical work pages · 8 internal anchors

[1]

Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher

Arsenii Alenichev, Jonathan D. Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher. 2025. ‘We can see a savage’: a case study of the colonial gaze in generative AI algorithms.AI & SOCIETY(Nov. 2025). doi:10.1007/s00146-025-02685-0

work page doi:10.1007/s00146-025-02685-0 2025
[2]

Anthropic. [n. d.]. Alignment Research. https://www.anthropic.com/research/team/alignment
[3]

Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

2023
[4]

Lora Aroyo and Chris Welty. 2015. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine36, 1 (March 2015), 15–24. doi:10.1609/aimag.v36i1.2564

work page doi:10.1609/aimag.v36i1.2564 2015
[5]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
[7]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM Hallucination Benchmark. doi:10.48550/arXiv.2504.17550 arXiv:2504.17550 [cs]

work page doi:10.48550/arxiv.2504.17550 2025
[8]

Christoph Bartneck. 2023. Godspeed Questionnaire Series: Translations and Usage. InInternational Handbook of Behavioral Health Assessment. Springer, Cham, 1–35. doi:10.1007/978-3-030-89738-3_24-1

work page doi:10.1007/978-3-030-89738-3_24-1 2023
[9]

Robert Booth and Lisa O’Carroll. 2025. Meta found in breach of EU law over‘ineffective’complaints system for flagging illegal content. The Guardian(Oct. 2025). https://www.theguardian.com/technology/2025/oct/24/instagram-facebook-breach-eu-law-content-flagging

2025
[10]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (Jan. 2006), 77–101. doi:10.1191/1478088706qp063oa _eprint: https://doi.org/10.1191/1478088706qp063oa

work page doi:10.1191/1478088706qp063oa 2006
[11]

Flavio Calvino, Daniel Haerle, and Sarah Liu. 2025. Is generative AI a General Purpose Technology?: Implications for productivity and policy.OECD Artificial Intelligence Papers(June 2025). doi:10.1787/704e2d12-en

work page doi:10.1787/704e2d12-en 2025
[12]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15217 2023
[13]

2023.Imagining AI: How the World Sees Intelligent Machines

Stephen Cave and Kanta Dihal (Eds.). 2023.Imagining AI: How the World Sees Intelligent Machines. Oxford University Press, Oxford, New York

2023
[14]

John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, and Michael Horn. 2025. Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets. doi:10.48550/arXiv.2504.02887 arXiv:2504.02887 [cs]

work page doi:10.48550/arxiv.2504.02887 2025
[15]

European Commission. 2025. Commission finds Apple and Meta in breach of the Digital Markets Act. https://ec.europa.eu/commission/ presscorner/detail/en/ip_25_1085

2025
[16]

European Commission. 2025. Commission fines X€120 million under the Digital Services Act | Shaping Europe ’s digital future. https://digital-strategy.ec.europa.eu/en/news/commission-fines-x-eu120-million-under-digital-services-act

2025
[17]

Holliday, Bob M

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mosse, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. 2024. Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. InProceedings of the 41st International Conference on Mach...

2024
[18]

2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory

Juliet Corbin and Anselm Strauss. 2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory. SAGE Publications, Inc. doi:10.4135/9781452230153

work page doi:10.4135/9781452230153 2008
[19]

Stefano De Paoli. 2024. Performing an Inductive Thematic Analysis of Semi-Structured Interviews With a Large Language Model: An Exploration and Provocation on the Limits of the Approach.Social Science Computer Review42, 4 (Aug. 2024), 997–1019. doi:10.1177/ 08944393231220483

2024
[20]

Dietvorst, Joseph P

Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err.Journal of Experimental Psychology: General144, 1 (2015), 114–126. doi:10.1037/xge0000033 What Do People Actually Want From AI? Mapping Preference Plurality FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

work page doi:10.1037/xge0000033 2015
[21]

Earp, Killian L

Brian D. Earp, Killian L. McLoughlin, Joshua T. Monrad, Margaret S. Clark, and Molly J. Crockett. 2021. How social relationships shape moral wrongness judgments.Nature Communications12, 1 (Oct. 2021), 5776. doi:10.1038/s41467-021-26067-4

work page doi:10.1038/s41467-021-26067-4 2021
[22]

Hans Esselborn. 2023. German Science Fiction Literature Exploring AI: Expectations, Hopes, and Fears. InImagining AI: How the World Sees Intelligent Machines, Stephen Cave and Kanta Dihal (Eds.). Oxford University Press, 0. doi:10.1093/oso/9780192865366.003.0005

work page doi:10.1093/oso/9780192865366.003.0005 2023
[23]

Facebook. 2015. Facebook’s 5 Core Values. https://www.facebook.com/media/set/?set=a.1655178611435493.1073741828. 1633466236940064

arXiv 2015
[24]

Michael Feffer, Michael Skirpan, Zachary Lipton, and Hoda Heidari. 2023. From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’23). Association for Computing Machinery, New York, NY, USA, 38–48. doi:10.1145/3600211.3604661

work page doi:10.1145/3600211.3604661 2023
[25]

Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30, 3 (Sept. 2020), 411–437. doi:10.1007/s11023- 020-09539-2

work page doi:10.1007/s11023- 2020
[26]

Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182, 7 (July 2025), 1951–1973. doi:10.1007/s11098-025-02300-4

work page doi:10.1007/s11098-025-02300-4 2025
[27]

Eva Johanna Gengler. 2024. Sexism, Racism, and Classism: Social Biases in Text-to-Image Generative AI in the Context of Power, Success, and Beauty.Wirtschaftsinformatik 2024 Proceedings(Jan. 2024). https://aisel.aisnet.org/wi2024/48

2024
[28]

2023.Trust in Artificial Intelligence: A global study

Nicole Gillespie, Steven Lockey, Caitlin Curtis, Javad Pool, and Ali Akbari. 2023.Trust in Artificial Intelligence: A global study. Technical Report. The University of Queensland; KPMG Australia, Brisbane, Australia. doi:10.14264/00d3c94

work page doi:10.14264/00d3c94 2023
[29]

Google. [n. d.]. Our approach - how Google Search works. https://www.google.com/intl/en_uk/search/howsearchworks/our-approach
[30]

Xin Han, Marten H. L. Kaas, and Cuizhu Dawn Wang. 2025. A Cross-Cultural Examination of Fairness Beliefs in Human-AI Interaction. doi:10.2139/ssrn.5116823

work page doi:10.2139/ssrn.5116823 2025
[31]

William Hobbs and Jon Green. 2025. Categorizing Topics Versus Inferring Attitudes: A Theory and Method for Analyzing Open-ended Survey Responses.Political Analysis33, 3 (July 2025), 231–251. doi:10.1017/pan.2024.23

work page doi:10.1017/pan.2024.23 2025
[32]

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect.Nature633, 8028 (Sept. 2024), 147–154. doi:10.1038/s41586-024-07856-5

work page doi:10.1038/s41586-024-07856-5 2024
[33]

Michel Hohendanner, Chiara Ullstein, Bukola Abimbola Onyekwelu, Amelia Katirai, Jun Kuribayashi, Olusola Babalola, Arisa Ema, and Jens Grossklags. 2025. Initiating the Global AI Dialogues: Laypeople Perspectives on the Future Role of genAI in Society from Nigeria, Germany and Japan. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sy...

work page doi:10.1145/3706598.3714322 2025
[34]

Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective Constitutional AI: Aligning a Language Model with Public Input. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 1395–1417. doi:10.1145/3630106.3658979 arXiv:2406.07814 [cs]

work page doi:10.1145/3630106.3658979 2024
[35]

Simon Hughes, Minseok Bae, and Miaoran Li. 2023. Vectara Hallucination Leaderboard. https://github.com/vectara/hallucination- leaderboard original-date: 2023-10-31T21:19:12Z

2023
[36]

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. 2025. Training language models to be warm and empathetic makes them less reliable and more sycophantic. doi:10.48550/arXiv.2507.21919 arXiv:2507.21919 [cs] version: 2

work page doi:10.48550/arxiv.2507.21919 2025
[37]

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.19852 2025
[38]

Bisantz, and Colin G

Jiun-Yin Jian, Ann M. Bisantz, and Colin G. Drury. 2000. Foundations for an Empirically Determined Scale of Trust in Auto- mated Systems.International Journal of Cognitive Ergonomics4, 1 (March 2000), 53–71. doi:10.1207/S15327566IJCE0401_04 _eprint: https://doi.org/10.1207/S15327566IJCE0401_04

work page doi:10.1207/s15327566ijce0401_04 2000
[39]

Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2025. A Survey on Human Preference Learning for Aligning Large Language Models.ACM Comput. Surv.58, 6 (Dec. 2025), 152:1–152:39. doi:10.1145/3773279

work page doi:10.1145/3773279 2025
[40]

Courtney Johnson and Alec Tyson. 2020. Are AI and job automation good for society? Globally, views are mixed. https://www.pewresearch.org/short-reads/2020/12/15/people-globally-offer-mixed-views-of-the-impact-of-artificial-intelligence- job-automation-on-society/

2020
[41]

Ekaterina Jussupow, Izak Benbasat, and Armin Heinzl. 2020. Why Are We Averse Towards Algorithms? A Comprehensive Literature Review on Algorithm Aversion.ECIS 2020 Research Papers(June 2020). https://aisel.aisnet.org/ecis2020_rp/168

2020
[42]

Dongjun Kang, Joonsuk Park, Yohan Jo, and JinYeong Bak. 2023. From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...

work page doi:10.18653/v1/ 2023
[43]

Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu Meena SP, and Nithya Sambasivan. 2022. ”Because AI is 100% right and safe”: User Attitudes and Sources of AI Authority in India. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Julia Sepúlveda Coelho and Scott A. Hale (CHI ’22...

work page doi:10.1145/3491102.3517533 2022
[44]

Platforms & Society , author =

Francisco W. Kerche, Matthew Zook, and Mark Graham. 2026. The silicon gaze: A typology of biases and inequality in LLMs through the lens of place.Platforms & Society3 (March 2026), 29768624251408919. doi:10.1177/29768624251408919

work page doi:10.1177/29768624251408919 2026
[45]

ISBN 9798400710940

Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, and Scott A. Hale. 2024. Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models. InProceedings of the 2024 International Conference on Information Technology for Social Good. 231–239. doi:10.1145/3677525.3678666 arXiv:2309.08573 [cs]

work page doi:10.1145/3677525.3678666 2024
[46]

Johannes Kiesel, Milad Alshomary, Nicolas Handke, Xiaoni Cai, Henning Wachsmuth, and Benno Stein. 2022. Identifying the Human Values behind Arguments. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computatio...

work page doi:10.18653/v1/2022.acl-long.306 2022
[47]

Bean, Bertie Vidgen, Paul Röttger, and Scott A

Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). As...

work page doi:10.18653/v1/2023.emnlp-main.148 2023
[48]

Alignment

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models. doi:10.48550/arXiv.2310.02457 arXiv:2310.02457 [cs]

work page doi:10.48550/arxiv.2310.02457 2023
[49]

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nature Machine Intelligence6, 4 (April 2024), 383–392. doi:10.1038/s42256-024-00820-y

work page doi:10.1038/s42256-024-00820-y 2024
[50]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

arXiv 2024
[51]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference (CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599

work page doi:10.1145/3582269.3615599 2023
[52]

Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, and Mateusz Maria Jurewicz. 2025. DarkBench: Benchmarking Dark Patterns in Large Language Models. doi:10.48550/arXiv.2503.10728 arXiv:2503.10728 [cs]

work page doi:10.48550/arxiv.2503.10728 2025
[53]

Smith, and Hannaneh Hajishirzi

Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, and Hannaneh Hajishirzi. 2024. ComPO: Community Preferences for Language Model Personalization. (2024). doi:10.48550/ARXIV.2410.16027 Version Number: 1

work page doi:10.48550/arxiv.2410.16027 2024
[54]

Cherie Lacey and Catherine Caudwell. 2019. Cuteness as a ‘Dark Pattern’ in Home Robots. In2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 374–381. doi:10.1109/HRI.2019.8673274 ISSN: 2167-2148

work page doi:10.1109/hri.2019.8673274 2019
[55]

Nathan Lambert and Roberto Calandra. 2024. The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback. doi:10.48550/arXiv.2311.00168 arXiv:2311.00168 [cs]

work page doi:10.48550/arxiv.2311.00168 2024
[56]

2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control

Vili Lehdonvirta. 2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control. The MIT Press. doi:10.7551/mitpress/14219.001.0001

work page doi:10.7551/mitpress/14219.001.0001 2022
[57]

Yuyun Li. 2024. Regulatory disputes between Brazil and X | Feature from King’s College London. https://www.kcl.ac.uk/regulatory- disputes-between-brazil-and-x

2024
[58]

Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. 2024. AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations. (2024). doi:10.48550/ARXIV.2406.18346 Version Number: 1

work page doi:10.48550/arxiv.2406.18346 2024
[59]

Philipp Lorenz-Spreen, Lisa Oswald, Stephan Lewandowsky, and Ralph Hertwig. 2022. A systematic review of worldwide causal and correlational evidence on digital media and democracy.Nature Human Behaviour(Nov. 2022), 1–28. doi:10.1038/s41562-022-01460-1

work page doi:10.1038/s41562-022-01460-1 2022
[60]

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator. doi:10.48550/arXiv.2505.16690 arXiv:2505.16690 [cs]

work page doi:10.48550/arxiv.2505.16690 2025
[61]

Nathanson, Elizabeth M

Amy I. Nathanson, Elizabeth M. Perse, and Douglas A. Ferguson. 1997. Gender differences in television use: An exploration of the instrumental-expressive dichotomy.Communication Research Reports14, 2 (March 1997), 176–188. doi:10.1080/08824099709388659 _eprint: https://doi.org/10.1080/08824099709388659

work page doi:10.1080/08824099709388659 1997
[62]

2018.Algorithms of Oppression: How Search Engines Reinforce Racism

Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press. doi:10.2307/j.ctt1pwt9w5

work page doi:10.2307/j.ctt1pwt9w5 2018
[63]

Office of Public Affairs. 2025. Department of Justice Prevails in Landmark Antitrust Case Against Google. https://www.justice.gov/opa/ pr/department-justice-prevails-landmark-antitrust-case-against-google

2025
[64]

OpenAI. 2024. Evaluating fairness in ChatGPT. https://openai.com/index/evaluating-fairness-in-chatgpt/

2024
[65]

OpenAI. 2024. Our approach to alignment research. https://openai.com/index/our-approach-to-alignment-research/

2024
[66]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[67]

Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, and He He. 2024. Beyond the Binary: Capturing Diverse Preferences With Reward Regularization. doi:10.48550/arXiv.2412.03822 arXiv:2412.03822 [cs]

work page doi:10.48550/arxiv.2412.03822 2024
[68]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kern...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09251 2022
[69]

Uwe Peters and Benjamin Chin-Yee. 2025. Generalization bias in large language model summarization of scientific research.Royal Society Open Science12, 4 (April 2025), 241776. doi:10.1098/rsos.241776

work page doi:10.1098/rsos.241776 2025
[70]

Steve Randerson, Thomas Graydon-Guy, En-Yi Lin, and Sally Casswell. 2025. Exploring the Use of a Large Language Model for Inductive Content Analysis in a Discourse Network Analysis Study.Social Science Computer Review(March 2025), 08944393251326175. doi:10.1177/08944393251326175

work page doi:10.1177/08944393251326175 2025
[71]

Varun Nagaraj Rao, Eesha Agarwal, Samantha Dalal, Dan Calacci, and Andrés Monroy-Hernández. 2025. QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums. doi:10.48550/arXiv.2405.05345 arXiv:2405.05345 [cs]

work page doi:10.48550/arxiv.2405.05345 2025
[72]

2013.The Coding Manual for Qualitative Researchers(3 ed.)

Johnny Saldaña. 2013.The Coding Manual for Qualitative Researchers(3 ed.). SAGE Publications. https://uk.sagepub.com/en-gb/eur/the- coding-manual-for-qualitative-researchers/book287917

2013
[73]

Shalom Schwartz. 2012. An Overview of the Schwartz Theory of Basic Values.Online Readings in Psychology and Culture2, 1 (Dec. 2012). doi:10.9707/2307-0919.1116

work page doi:10.9707/2307-0919.1116 2012
[74]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. Towards Understanding Sycophancy in Language Models. doi:1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548 2025
[75]

Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens. 2025. ...

work page doi:10.48550/arxiv.2406.09264 2025
[76]

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanushree Mitra, and Yun Huang. 2025. ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs. doi:10.48550/arXiv.2409.09586 arXiv:2409.09586 [cs]

work page doi:10.48550/arxiv.2409.09586 2025
[77]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (July 2024), 755–759. doi:10.1038/s41586-024-07566-y

work page doi:10.1038/s41586-024-07566-y 2024
[78]

participation

Mona Sloane. 2024. Controversies, contradiction, and “participation” in AI.Big Data & Society11, 1 (March 2024), 20539517241235862. doi:10.1177/20539517241235862

work page doi:10.1177/20539517241235862 2024
[79]

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A Roadmap to Pluralistic Alignment. doi:10.48550/arXiv.2402.05070 arXiv:2402.05070 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.05070 2024
[80]

Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. 2025. Large Language Models are overconfident and amplify human bias. doi:10.48550/arXiv.2505.02151 arXiv:2505.02151 [cs]

work page doi:10.48550/arxiv.2505.02151 2025

Showing first 80 references.

[1] [1]

Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher

Arsenii Alenichev, Jonathan D. Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher. 2025. ‘We can see a savage’: a case study of the colonial gaze in generative AI algorithms.AI & SOCIETY(Nov. 2025). doi:10.1007/s00146-025-02685-0

work page doi:10.1007/s00146-025-02685-0 2025

[2] [2]

Anthropic. [n. d.]. Alignment Research. https://www.anthropic.com/research/team/alignment

[3] [3]

Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

2023

[4] [4]

Lora Aroyo and Chris Welty. 2015. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine36, 1 (March 2015), 15–24. doi:10.1609/aimag.v36i1.2564

work page doi:10.1609/aimag.v36i1.2564 2015

[5] [5]

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022

[7] [7]

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM Hallucination Benchmark. doi:10.48550/arXiv.2504.17550 arXiv:2504.17550 [cs]

work page doi:10.48550/arxiv.2504.17550 2025

[8] [8]

Christoph Bartneck. 2023. Godspeed Questionnaire Series: Translations and Usage. InInternational Handbook of Behavioral Health Assessment. Springer, Cham, 1–35. doi:10.1007/978-3-030-89738-3_24-1

work page doi:10.1007/978-3-030-89738-3_24-1 2023

[9] [9]

Robert Booth and Lisa O’Carroll. 2025. Meta found in breach of EU law over‘ineffective’complaints system for flagging illegal content. The Guardian(Oct. 2025). https://www.theguardian.com/technology/2025/oct/24/instagram-facebook-breach-eu-law-content-flagging

2025

[10] [10]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (Jan. 2006), 77–101. doi:10.1191/1478088706qp063oa _eprint: https://doi.org/10.1191/1478088706qp063oa

work page doi:10.1191/1478088706qp063oa 2006

[11] [11]

Flavio Calvino, Daniel Haerle, and Sarah Liu. 2025. Is generative AI a General Purpose Technology?: Implications for productivity and policy.OECD Artificial Intelligence Papers(June 2025). doi:10.1787/704e2d12-en

work page doi:10.1787/704e2d12-en 2025

[12] [12]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15217 2023

[13] [13]

2023.Imagining AI: How the World Sees Intelligent Machines

Stephen Cave and Kanta Dihal (Eds.). 2023.Imagining AI: How the World Sees Intelligent Machines. Oxford University Press, Oxford, New York

2023

[14] [14]

John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, and Michael Horn. 2025. Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets. doi:10.48550/arXiv.2504.02887 arXiv:2504.02887 [cs]

work page doi:10.48550/arxiv.2504.02887 2025

[15] [15]

European Commission. 2025. Commission finds Apple and Meta in breach of the Digital Markets Act. https://ec.europa.eu/commission/ presscorner/detail/en/ip_25_1085

2025

[16] [16]

European Commission. 2025. Commission fines X€120 million under the Digital Services Act | Shaping Europe ’s digital future. https://digital-strategy.ec.europa.eu/en/news/commission-fines-x-eu120-million-under-digital-services-act

2025

[17] [17]

Holliday, Bob M

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mosse, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. 2024. Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. InProceedings of the 41st International Conference on Mach...

2024

[18] [18]

2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory

Juliet Corbin and Anselm Strauss. 2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory. SAGE Publications, Inc. doi:10.4135/9781452230153

work page doi:10.4135/9781452230153 2008

[19] [19]

Stefano De Paoli. 2024. Performing an Inductive Thematic Analysis of Semi-Structured Interviews With a Large Language Model: An Exploration and Provocation on the Limits of the Approach.Social Science Computer Review42, 4 (Aug. 2024), 997–1019. doi:10.1177/ 08944393231220483

2024

[20] [20]

Dietvorst, Joseph P

Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err.Journal of Experimental Psychology: General144, 1 (2015), 114–126. doi:10.1037/xge0000033 What Do People Actually Want From AI? Mapping Preference Plurality FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

work page doi:10.1037/xge0000033 2015

[21] [21]

Earp, Killian L

Brian D. Earp, Killian L. McLoughlin, Joshua T. Monrad, Margaret S. Clark, and Molly J. Crockett. 2021. How social relationships shape moral wrongness judgments.Nature Communications12, 1 (Oct. 2021), 5776. doi:10.1038/s41467-021-26067-4

work page doi:10.1038/s41467-021-26067-4 2021

[22] [22]

Hans Esselborn. 2023. German Science Fiction Literature Exploring AI: Expectations, Hopes, and Fears. InImagining AI: How the World Sees Intelligent Machines, Stephen Cave and Kanta Dihal (Eds.). Oxford University Press, 0. doi:10.1093/oso/9780192865366.003.0005

work page doi:10.1093/oso/9780192865366.003.0005 2023

[23] [23]

Facebook. 2015. Facebook’s 5 Core Values. https://www.facebook.com/media/set/?set=a.1655178611435493.1073741828. 1633466236940064

arXiv 2015

[24] [24]

Michael Feffer, Michael Skirpan, Zachary Lipton, and Hoda Heidari. 2023. From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’23). Association for Computing Machinery, New York, NY, USA, 38–48. doi:10.1145/3600211.3604661

work page doi:10.1145/3600211.3604661 2023

[25] [25]

Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30, 3 (Sept. 2020), 411–437. doi:10.1007/s11023- 020-09539-2

work page doi:10.1007/s11023- 2020

[26] [26]

Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182, 7 (July 2025), 1951–1973. doi:10.1007/s11098-025-02300-4

work page doi:10.1007/s11098-025-02300-4 2025

[27] [27]

Eva Johanna Gengler. 2024. Sexism, Racism, and Classism: Social Biases in Text-to-Image Generative AI in the Context of Power, Success, and Beauty.Wirtschaftsinformatik 2024 Proceedings(Jan. 2024). https://aisel.aisnet.org/wi2024/48

2024

[28] [28]

2023.Trust in Artificial Intelligence: A global study

Nicole Gillespie, Steven Lockey, Caitlin Curtis, Javad Pool, and Ali Akbari. 2023.Trust in Artificial Intelligence: A global study. Technical Report. The University of Queensland; KPMG Australia, Brisbane, Australia. doi:10.14264/00d3c94

work page doi:10.14264/00d3c94 2023

[29] [29]

Google. [n. d.]. Our approach - how Google Search works. https://www.google.com/intl/en_uk/search/howsearchworks/our-approach

[30] [30]

Xin Han, Marten H. L. Kaas, and Cuizhu Dawn Wang. 2025. A Cross-Cultural Examination of Fairness Beliefs in Human-AI Interaction. doi:10.2139/ssrn.5116823

work page doi:10.2139/ssrn.5116823 2025

[31] [31]

William Hobbs and Jon Green. 2025. Categorizing Topics Versus Inferring Attitudes: A Theory and Method for Analyzing Open-ended Survey Responses.Political Analysis33, 3 (July 2025), 231–251. doi:10.1017/pan.2024.23

work page doi:10.1017/pan.2024.23 2025

[32] [32]

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect.Nature633, 8028 (Sept. 2024), 147–154. doi:10.1038/s41586-024-07856-5

work page doi:10.1038/s41586-024-07856-5 2024

[33] [33]

Michel Hohendanner, Chiara Ullstein, Bukola Abimbola Onyekwelu, Amelia Katirai, Jun Kuribayashi, Olusola Babalola, Arisa Ema, and Jens Grossklags. 2025. Initiating the Global AI Dialogues: Laypeople Perspectives on the Future Role of genAI in Society from Nigeria, Germany and Japan. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sy...

work page doi:10.1145/3706598.3714322 2025

[34] [34]

Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective Constitutional AI: Aligning a Language Model with Public Input. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 1395–1417. doi:10.1145/3630106.3658979 arXiv:2406.07814 [cs]

work page doi:10.1145/3630106.3658979 2024

[35] [35]

Simon Hughes, Minseok Bae, and Miaoran Li. 2023. Vectara Hallucination Leaderboard. https://github.com/vectara/hallucination- leaderboard original-date: 2023-10-31T21:19:12Z

2023

[36] [36]

Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. 2025. Training language models to be warm and empathetic makes them less reliable and more sycophantic. doi:10.48550/arXiv.2507.21919 arXiv:2507.21919 [cs] version: 2

work page doi:10.48550/arxiv.2507.21919 2025

[37] [37]

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.19852 2025

[38] [38]

Bisantz, and Colin G

Jiun-Yin Jian, Ann M. Bisantz, and Colin G. Drury. 2000. Foundations for an Empirically Determined Scale of Trust in Auto- mated Systems.International Journal of Cognitive Ergonomics4, 1 (March 2000), 53–71. doi:10.1207/S15327566IJCE0401_04 _eprint: https://doi.org/10.1207/S15327566IJCE0401_04

work page doi:10.1207/s15327566ijce0401_04 2000

[39] [39]

Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2025. A Survey on Human Preference Learning for Aligning Large Language Models.ACM Comput. Surv.58, 6 (Dec. 2025), 152:1–152:39. doi:10.1145/3773279

work page doi:10.1145/3773279 2025

[40] [40]

Courtney Johnson and Alec Tyson. 2020. Are AI and job automation good for society? Globally, views are mixed. https://www.pewresearch.org/short-reads/2020/12/15/people-globally-offer-mixed-views-of-the-impact-of-artificial-intelligence- job-automation-on-society/

2020

[41] [41]

Ekaterina Jussupow, Izak Benbasat, and Armin Heinzl. 2020. Why Are We Averse Towards Algorithms? A Comprehensive Literature Review on Algorithm Aversion.ECIS 2020 Research Papers(June 2020). https://aisel.aisnet.org/ecis2020_rp/168

2020

[42] [42]

Dongjun Kang, Joonsuk Park, Yohan Jo, and JinYeong Bak. 2023. From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...

work page doi:10.18653/v1/ 2023

[43] [43]

Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu Meena SP, and Nithya Sambasivan. 2022. ”Because AI is 100% right and safe”: User Attitudes and Sources of AI Authority in India. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Julia Sepúlveda Coelho and Scott A. Hale (CHI ’22...

work page doi:10.1145/3491102.3517533 2022

[44] [44]

Platforms & Society , author =

Francisco W. Kerche, Matthew Zook, and Mark Graham. 2026. The silicon gaze: A typology of biases and inequality in LLMs through the lens of place.Platforms & Society3 (March 2026), 29768624251408919. doi:10.1177/29768624251408919

work page doi:10.1177/29768624251408919 2026

[45] [45]

ISBN 9798400710940

Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, and Scott A. Hale. 2024. Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models. InProceedings of the 2024 International Conference on Information Technology for Social Good. 231–239. doi:10.1145/3677525.3678666 arXiv:2309.08573 [cs]

work page doi:10.1145/3677525.3678666 2024

[46] [46]

Johannes Kiesel, Milad Alshomary, Nicolas Handke, Xiaoni Cai, Henning Wachsmuth, and Benno Stein. 2022. Identifying the Human Values behind Arguments. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computatio...

work page doi:10.18653/v1/2022.acl-long.306 2022

[47] [47]

Bean, Bertie Vidgen, Paul Röttger, and Scott A

Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). As...

work page doi:10.18653/v1/2023.emnlp-main.148 2023

[48] [48]

Alignment

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models. doi:10.48550/arXiv.2310.02457 arXiv:2310.02457 [cs]

work page doi:10.48550/arxiv.2310.02457 2023

[49] [49]

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nature Machine Intelligence6, 4 (April 2024), 383–392. doi:10.1038/s42256-024-00820-y

work page doi:10.1038/s42256-024-00820-y 2024

[50] [50]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

arXiv 2024

[51] [51]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference (CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599

work page doi:10.1145/3582269.3615599 2023

[52] [52]

Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, and Mateusz Maria Jurewicz. 2025. DarkBench: Benchmarking Dark Patterns in Large Language Models. doi:10.48550/arXiv.2503.10728 arXiv:2503.10728 [cs]

work page doi:10.48550/arxiv.2503.10728 2025

[53] [53]

Smith, and Hannaneh Hajishirzi

Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, and Hannaneh Hajishirzi. 2024. ComPO: Community Preferences for Language Model Personalization. (2024). doi:10.48550/ARXIV.2410.16027 Version Number: 1

work page doi:10.48550/arxiv.2410.16027 2024

[54] [54]

Cherie Lacey and Catherine Caudwell. 2019. Cuteness as a ‘Dark Pattern’ in Home Robots. In2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 374–381. doi:10.1109/HRI.2019.8673274 ISSN: 2167-2148

work page doi:10.1109/hri.2019.8673274 2019

[55] [55]

Nathan Lambert and Roberto Calandra. 2024. The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback. doi:10.48550/arXiv.2311.00168 arXiv:2311.00168 [cs]

work page doi:10.48550/arxiv.2311.00168 2024

[56] [56]

2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control

Vili Lehdonvirta. 2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control. The MIT Press. doi:10.7551/mitpress/14219.001.0001

work page doi:10.7551/mitpress/14219.001.0001 2022

[57] [57]

Yuyun Li. 2024. Regulatory disputes between Brazil and X | Feature from King’s College London. https://www.kcl.ac.uk/regulatory- disputes-between-brazil-and-x

2024

[58] [58]

Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. 2024. AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations. (2024). doi:10.48550/ARXIV.2406.18346 Version Number: 1

work page doi:10.48550/arxiv.2406.18346 2024

[59] [59]

Philipp Lorenz-Spreen, Lisa Oswald, Stephan Lewandowsky, and Ralph Hertwig. 2022. A systematic review of worldwide causal and correlational evidence on digital media and democracy.Nature Human Behaviour(Nov. 2022), 1–28. doi:10.1038/s41562-022-01460-1

work page doi:10.1038/s41562-022-01460-1 2022

[60] [60]

Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator. doi:10.48550/arXiv.2505.16690 arXiv:2505.16690 [cs]

work page doi:10.48550/arxiv.2505.16690 2025

[61] [61]

Nathanson, Elizabeth M

Amy I. Nathanson, Elizabeth M. Perse, and Douglas A. Ferguson. 1997. Gender differences in television use: An exploration of the instrumental-expressive dichotomy.Communication Research Reports14, 2 (March 1997), 176–188. doi:10.1080/08824099709388659 _eprint: https://doi.org/10.1080/08824099709388659

work page doi:10.1080/08824099709388659 1997

[62] [62]

2018.Algorithms of Oppression: How Search Engines Reinforce Racism

Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press. doi:10.2307/j.ctt1pwt9w5

work page doi:10.2307/j.ctt1pwt9w5 2018

[63] [63]

Office of Public Affairs. 2025. Department of Justice Prevails in Landmark Antitrust Case Against Google. https://www.justice.gov/opa/ pr/department-justice-prevails-landmark-antitrust-case-against-google

2025

[64] [64]

OpenAI. 2024. Evaluating fairness in ChatGPT. https://openai.com/index/evaluating-fairness-in-chatgpt/

2024

[65] [65]

OpenAI. 2024. Our approach to alignment research. https://openai.com/index/our-approach-to-alignment-research/

2024

[66] [66]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[67] [67]

Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, and He He. 2024. Beyond the Binary: Capturing Diverse Preferences With Reward Regularization. doi:10.48550/arXiv.2412.03822 arXiv:2412.03822 [cs]

work page doi:10.48550/arxiv.2412.03822 2024

[68] [68]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kern...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09251 2022

[69] [69]

Uwe Peters and Benjamin Chin-Yee. 2025. Generalization bias in large language model summarization of scientific research.Royal Society Open Science12, 4 (April 2025), 241776. doi:10.1098/rsos.241776

work page doi:10.1098/rsos.241776 2025

[70] [70]

Steve Randerson, Thomas Graydon-Guy, En-Yi Lin, and Sally Casswell. 2025. Exploring the Use of a Large Language Model for Inductive Content Analysis in a Discourse Network Analysis Study.Social Science Computer Review(March 2025), 08944393251326175. doi:10.1177/08944393251326175

work page doi:10.1177/08944393251326175 2025

[71] [71]

Varun Nagaraj Rao, Eesha Agarwal, Samantha Dalal, Dan Calacci, and Andrés Monroy-Hernández. 2025. QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums. doi:10.48550/arXiv.2405.05345 arXiv:2405.05345 [cs]

work page doi:10.48550/arxiv.2405.05345 2025

[72] [72]

2013.The Coding Manual for Qualitative Researchers(3 ed.)

Johnny Saldaña. 2013.The Coding Manual for Qualitative Researchers(3 ed.). SAGE Publications. https://uk.sagepub.com/en-gb/eur/the- coding-manual-for-qualitative-researchers/book287917

2013

[73] [73]

Shalom Schwartz. 2012. An Overview of the Schwartz Theory of Basic Values.Online Readings in Psychology and Culture2, 1 (Dec. 2012). doi:10.9707/2307-0919.1116

work page doi:10.9707/2307-0919.1116 2012

[74] [74]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. Towards Understanding Sycophancy in Language Models. doi:1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548 2025

[75] [75]

Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens. 2025. ...

work page doi:10.48550/arxiv.2406.09264 2025

[76] [76]

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanushree Mitra, and Yun Huang. 2025. ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs. doi:10.48550/arXiv.2409.09586 arXiv:2409.09586 [cs]

work page doi:10.48550/arxiv.2409.09586 2025

[77] [77]

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (July 2024), 755–759. doi:10.1038/s41586-024-07566-y

work page doi:10.1038/s41586-024-07566-y 2024

[78] [78]

participation

Mona Sloane. 2024. Controversies, contradiction, and “participation” in AI.Big Data & Society11, 1 (March 2024), 20539517241235862. doi:10.1177/20539517241235862

work page doi:10.1177/20539517241235862 2024

[79] [79]

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A Roadmap to Pluralistic Alignment. doi:10.48550/arXiv.2402.05070 arXiv:2402.05070 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.05070 2024

[80] [80]

Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. 2025. Large Language Models are overconfident and amplify human bias. doi:10.48550/arXiv.2505.02151 arXiv:2505.02151 [cs]

work page doi:10.48550/arxiv.2505.02151 2025