pith. sign in

arxiv: 2606.06674 · v1 · pith:6MW5DLPQnew · submitted 2026-06-04 · 💻 cs.CL · cs.CY

What Do People Actually Want From AI? Mapping Preference Plurality

Pith reviewed 2026-06-28 01:29 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords AI alignmentRLHFpreference pluralitytruthfulnesshuman feedbackpreference elicitationepistemic violence
0
0 comments X

The pith

People's preferences for AI diverge sharply, with even 'truthfulness' carrying incompatible meanings across respondents that single reward models cannot capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes 1,500 open-ended responses from the PRISM dataset across 75 countries to identify what users actually want from AI systems. It shows that most requested values come from fewer than a quarter of people, except for truthfulness at 49 percent. The same terms reveal divergent definitions, such as truthfulness meaning sourced claims for some and expert opinions or unpopular views for others. Capabilities like human-like behavior and features like guardrails prove controversial, with some wanting them and others rejecting them. People also draw contextual lines, such as default versus requested behaviors, that binary comparison methods miss.

Core claim

The central claim is that preference plurality and semantic divergence in open-ended responses expose fundamental limits in RLHF alignment: when nearly half request truthfulness but define it differently, and when contextual distinctions exceed binary comparisons, current aggregation into a single reward model flattens situated signals and fails to match actual user demands, as seen in persistent high hallucination rates despite clear accuracy preferences.

What carries the argument

Qualitative coding of open-ended responses from the PRISM dataset that surfaces preference plurality and incompatible epistemological bases behind shared terms.

If this is right

  • A single reward model is unlikely to satisfy the varied definitions of truthfulness.
  • Binary comparisons cannot encode distinctions between default and requested behaviors.
  • Controversial features like guardrails will produce both demand and rejection.
  • Persistent hallucinations indicate that current methods do not identify the accuracy preferences expressed in the data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment systems may require multiple or context-switching models rather than one universal preference function.
  • Future preference datasets would benefit from including open-ended questions alongside ratings to avoid flattening signals.
  • The findings connect to broader questions about how to handle value pluralism when building public AI systems.

Load-bearing premise

The qualitative reading of divergent meanings in the responses correctly identifies incompatible bases without substantial researcher bias or sampling effects.

What would settle it

A single reward model trained only on binary comparisons that produces outputs matching the full range of definitions for truthfulness given in the open responses.

read the original abstract

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes 1,500 open-ended responses from the PRISM dataset across 75 countries to map user preferences for AI systems. It reports that preferences are highly diverse, with most values requested by fewer than 25% of respondents and truthfulness the only exception at 49%; however, open-ended elaborations on truthfulness reveal divergent epistemological bases (sourced claims vs. expert opinion vs. unpopular views). The work further identifies controversial features (e.g., human-likeness, guardrails) and contextual distinctions (default vs. requested behavior) that binary preference comparisons cannot capture, concluding that these pluralities cannot be aggregated into single reward models and that current RLHF practices therefore fail to identify actual preferences.

Significance. If the qualitative interpretations are robust, the findings provide concrete empirical grounding for known limitations of RLHF, showing that lexical agreement on values like truthfulness masks incompatible underlying demands. This could motivate development of alignment methods that handle preference plurality and context rather than forcing aggregation, and the use of a large multi-country open-ended dataset is a strength relative to typical binary-feedback studies.

major comments (2)
  1. [Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.
  2. [Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.
minor comments (2)
  1. [Abstract] The abstract is dense and could be split for clarity; the phrase 'epistemic violence' is used without a direct citation to the source work that introduced the characterization.
  2. [Results] Figure or table summarizing the distribution of requested values (beyond the 49% truthfulness figure) would help readers assess the 'most values requested by fewer than a quarter' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.

    Authors: We agree that the manuscript lacks sufficient detail on the qualitative coding process, which is important for evaluating the robustness of the incompatibility claims. The coding was conducted iteratively by the lead authors through repeated review and discussion to identify and categorize the distinct epistemological bases (e.g., sourced claims, expert opinion, unpopular views). No formal inter-rater reliability statistic such as Cohen's kappa was computed, and coder training was informal. In revision, we will add a dedicated Methods subsection describing the full protocol, the number of coders, the consensus process, and any steps taken to reduce framing bias (such as using multiple independent readings before joint discussion). This addition will directly support the interpretation without altering the core findings. revision: yes

  2. Referee: [Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.

    Authors: We acknowledge that the manuscript does not include a self-contained summary of the PRISM sample characteristics, exclusion rules, or representativeness, which limits assessment of generalizability. Although the PRISM dataset paper provides these details, the current work does not extract or discuss them. We will revise the manuscript to add a subsection on the dataset and sample, including available demographic breakdowns, any exclusion criteria applied to arrive at the 1,500 responses, and an explicit discussion of representativeness limitations. This will better ground the claims about preferences across populations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical survey with no derivation or fitting chain

full rationale

The paper is a qualitative and quantitative analysis of 1,500 open-ended survey responses from the PRISM dataset. It reports frequencies (e.g., truthfulness requested by 49%), codes open responses for divergent meanings, and draws interpretive conclusions about alignment methods. No equations, parameters, predictions, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained against external benchmarks (survey data) and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical qualitative study whose conclusions rest on the validity of the PRISM dataset and the authors' interpretations of free-text responses.

axioms (1)
  • domain assumption The 1,500 open-ended responses in the PRISM dataset provide representative and unbiased signals of what people actually want from AI.
    All claims about preference diversity and failures of current alignment methods depend on this premise.

pith-pipeline@v0.9.1-grok · 5834 in / 1138 out tokens · 35206 ms · 2026-06-28T01:29:40.037063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 70 canonical work pages · 8 internal anchors

  1. [1]

    Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher

    Arsenii Alenichev, Jonathan D. Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher. 2025. ‘We can see a savage’: a case study of the colonial gaze in generative AI algorithms.AI & SOCIETY(Nov. 2025). doi:10.1007/s00146-025-02685-0

  2. [2]

    Anthropic. [n. d.]. Alignment Research. https://www.anthropic.com/research/team/alignment

  3. [3]

    Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution

  4. [4]

    Lora Aroyo and Chris Welty. 2015. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine36, 1 (March 2015), 15–24. doi:10.1609/aimag.v36i1.2564

  5. [5]

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...

  6. [6]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  7. [7]

    Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM Hallucination Benchmark. doi:10.48550/arXiv.2504.17550 arXiv:2504.17550 [cs]

  8. [8]

    Christoph Bartneck. 2023. Godspeed Questionnaire Series: Translations and Usage. InInternational Handbook of Behavioral Health Assessment. Springer, Cham, 1–35. doi:10.1007/978-3-030-89738-3_24-1

  9. [9]

    Robert Booth and Lisa O’Carroll. 2025. Meta found in breach of EU law over‘ineffective’complaints system for flagging illegal content. The Guardian(Oct. 2025). https://www.theguardian.com/technology/2025/oct/24/instagram-facebook-breach-eu-law-content-flagging

  10. [10]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (Jan. 2006), 77–101. doi:10.1191/1478088706qp063oa _eprint: https://doi.org/10.1191/1478088706qp063oa

  11. [11]

    Flavio Calvino, Daniel Haerle, and Sarah Liu. 2025. Is generative AI a General Purpose Technology?: Implications for productivity and policy.OECD Artificial Intelligence Papers(June 2025). doi:10.1787/704e2d12-en

  12. [12]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob...

  13. [13]

    2023.Imagining AI: How the World Sees Intelligent Machines

    Stephen Cave and Kanta Dihal (Eds.). 2023.Imagining AI: How the World Sees Intelligent Machines. Oxford University Press, Oxford, New York

  14. [14]

    John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, and Michael Horn. 2025. Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets. doi:10.48550/arXiv.2504.02887 arXiv:2504.02887 [cs]

  15. [15]

    European Commission. 2025. Commission finds Apple and Meta in breach of the Digital Markets Act. https://ec.europa.eu/commission/ presscorner/detail/en/ip_25_1085

  16. [16]

    European Commission. 2025. Commission fines X€120 million under the Digital Services Act | Shaping Europe ’s digital future. https://digital-strategy.ec.europa.eu/en/news/commission-fines-x-eu120-million-under-digital-services-act

  17. [17]

    Holliday, Bob M

    Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mosse, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. 2024. Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. InProceedings of the 41st International Conference on Mach...

  18. [18]

    2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory

    Juliet Corbin and Anselm Strauss. 2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory. SAGE Publications, Inc. doi:10.4135/9781452230153

  19. [19]

    Stefano De Paoli. 2024. Performing an Inductive Thematic Analysis of Semi-Structured Interviews With a Large Language Model: An Exploration and Provocation on the Limits of the Approach.Social Science Computer Review42, 4 (Aug. 2024), 997–1019. doi:10.1177/ 08944393231220483

  20. [20]

    Dietvorst, Joseph P

    Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err.Journal of Experimental Psychology: General144, 1 (2015), 114–126. doi:10.1037/xge0000033 What Do People Actually Want From AI? Mapping Preference Plurality FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

  21. [21]

    Earp, Killian L

    Brian D. Earp, Killian L. McLoughlin, Joshua T. Monrad, Margaret S. Clark, and Molly J. Crockett. 2021. How social relationships shape moral wrongness judgments.Nature Communications12, 1 (Oct. 2021), 5776. doi:10.1038/s41467-021-26067-4

  22. [22]

    Hans Esselborn. 2023. German Science Fiction Literature Exploring AI: Expectations, Hopes, and Fears. InImagining AI: How the World Sees Intelligent Machines, Stephen Cave and Kanta Dihal (Eds.). Oxford University Press, 0. doi:10.1093/oso/9780192865366.003.0005

  23. [23]

    Facebook. 2015. Facebook’s 5 Core Values. https://www.facebook.com/media/set/?set=a.1655178611435493.1073741828. 1633466236940064

  24. [24]

    Michael Feffer, Michael Skirpan, Zachary Lipton, and Hoda Heidari. 2023. From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’23). Association for Computing Machinery, New York, NY, USA, 38–48. doi:10.1145/3600211.3604661

  25. [25]

    Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30, 3 (Sept. 2020), 411–437. doi:10.1007/s11023- 020-09539-2

  26. [26]

    Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182, 7 (July 2025), 1951–1973. doi:10.1007/s11098-025-02300-4

  27. [27]

    Eva Johanna Gengler. 2024. Sexism, Racism, and Classism: Social Biases in Text-to-Image Generative AI in the Context of Power, Success, and Beauty.Wirtschaftsinformatik 2024 Proceedings(Jan. 2024). https://aisel.aisnet.org/wi2024/48

  28. [28]

    2023.Trust in Artificial Intelligence: A global study

    Nicole Gillespie, Steven Lockey, Caitlin Curtis, Javad Pool, and Ali Akbari. 2023.Trust in Artificial Intelligence: A global study. Technical Report. The University of Queensland; KPMG Australia, Brisbane, Australia. doi:10.14264/00d3c94

  29. [29]

    Google. [n. d.]. Our approach - how Google Search works. https://www.google.com/intl/en_uk/search/howsearchworks/our-approach

  30. [30]

    Xin Han, Marten H. L. Kaas, and Cuizhu Dawn Wang. 2025. A Cross-Cultural Examination of Fairness Beliefs in Human-AI Interaction. doi:10.2139/ssrn.5116823

  31. [31]

    William Hobbs and Jon Green. 2025. Categorizing Topics Versus Inferring Attitudes: A Theory and Method for Analyzing Open-ended Survey Responses.Political Analysis33, 3 (July 2025), 231–251. doi:10.1017/pan.2024.23

  32. [32]

    Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect.Nature633, 8028 (Sept. 2024), 147–154. doi:10.1038/s41586-024-07856-5

  33. [33]

    Michel Hohendanner, Chiara Ullstein, Bukola Abimbola Onyekwelu, Amelia Katirai, Jun Kuribayashi, Olusola Babalola, Arisa Ema, and Jens Grossklags. 2025. Initiating the Global AI Dialogues: Laypeople Perspectives on the Future Role of genAI in Society from Nigeria, Germany and Japan. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sy...

  34. [34]

    Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

    Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective Constitutional AI: Aligning a Language Model with Public Input. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 1395–1417. doi:10.1145/3630106.3658979 arXiv:2406.07814 [cs]

  35. [35]

    Simon Hughes, Minseok Bae, and Miaoran Li. 2023. Vectara Hallucination Leaderboard. https://github.com/vectara/hallucination- leaderboard original-date: 2023-10-31T21:19:12Z

  36. [36]

    Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. 2025. Training language models to be warm and empathetic makes them less reliable and more sycophantic. doi:10.48550/arXiv.2507.21919 arXiv:2507.21919 [cs] version: 2

  37. [37]

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...

  38. [38]

    Bisantz, and Colin G

    Jiun-Yin Jian, Ann M. Bisantz, and Colin G. Drury. 2000. Foundations for an Empirically Determined Scale of Trust in Auto- mated Systems.International Journal of Cognitive Ergonomics4, 1 (March 2000), 53–71. doi:10.1207/S15327566IJCE0401_04 _eprint: https://doi.org/10.1207/S15327566IJCE0401_04

  39. [39]

    Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2025. A Survey on Human Preference Learning for Aligning Large Language Models.ACM Comput. Surv.58, 6 (Dec. 2025), 152:1–152:39. doi:10.1145/3773279

  40. [40]

    Courtney Johnson and Alec Tyson. 2020. Are AI and job automation good for society? Globally, views are mixed. https://www.pewresearch.org/short-reads/2020/12/15/people-globally-offer-mixed-views-of-the-impact-of-artificial-intelligence- job-automation-on-society/

  41. [41]

    Ekaterina Jussupow, Izak Benbasat, and Armin Heinzl. 2020. Why Are We Averse Towards Algorithms? A Comprehensive Literature Review on Algorithm Aversion.ECIS 2020 Research Papers(June 2020). https://aisel.aisnet.org/ecis2020_rp/168

  42. [42]

    Dongjun Kang, Joonsuk Park, Yohan Jo, and JinYeong Bak. 2023. From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...

  43. [43]

    Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu Meena SP, and Nithya Sambasivan. 2022. ”Because AI is 100% right and safe”: User Attitudes and Sources of AI Authority in India. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Julia Sepúlveda Coelho and Scott A. Hale (CHI ’22...

  44. [44]

    Platforms & Society , author =

    Francisco W. Kerche, Matthew Zook, and Mark Graham. 2026. The silicon gaze: A typology of biases and inequality in LLMs through the lens of place.Platforms & Society3 (March 2026), 29768624251408919. doi:10.1177/29768624251408919

  45. [45]

    ISBN 9798400710940

    Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, and Scott A. Hale. 2024. Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models. InProceedings of the 2024 International Conference on Information Technology for Social Good. 231–239. doi:10.1145/3677525.3678666 arXiv:2309.08573 [cs]

  46. [46]

    Johannes Kiesel, Milad Alshomary, Nicolas Handke, Xiaoni Cai, Henning Wachsmuth, and Benno Stein. 2022. Identifying the Human Values behind Arguments. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computatio...

  47. [47]

    Bean, Bertie Vidgen, Paul Röttger, and Scott A

    Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). As...

  48. [48]

    Alignment

    Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models. doi:10.48550/arXiv.2310.02457 arXiv:2310.02457 [cs]

  49. [49]

    Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nature Machine Intelligence6, 4 (April 2024), 383–392. doi:10.1038/s42256-024-00820-y

  50. [50]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

  51. [51]

    Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference (CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599

  52. [52]

    Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, and Mateusz Maria Jurewicz. 2025. DarkBench: Benchmarking Dark Patterns in Large Language Models. doi:10.48550/arXiv.2503.10728 arXiv:2503.10728 [cs]

  53. [53]

    Smith, and Hannaneh Hajishirzi

    Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, and Hannaneh Hajishirzi. 2024. ComPO: Community Preferences for Language Model Personalization. (2024). doi:10.48550/ARXIV.2410.16027 Version Number: 1

  54. [54]

    Cherie Lacey and Catherine Caudwell. 2019. Cuteness as a ‘Dark Pattern’ in Home Robots. In2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 374–381. doi:10.1109/HRI.2019.8673274 ISSN: 2167-2148

  55. [55]

    Nathan Lambert and Roberto Calandra. 2024. The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback. doi:10.48550/arXiv.2311.00168 arXiv:2311.00168 [cs]

  56. [56]

    2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control

    Vili Lehdonvirta. 2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control. The MIT Press. doi:10.7551/mitpress/14219.001.0001

  57. [57]

    Yuyun Li. 2024. Regulatory disputes between Brazil and X | Feature from King’s College London. https://www.kcl.ac.uk/regulatory- disputes-between-brazil-and-x

  58. [58]

    Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. 2024. AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations. (2024). doi:10.48550/ARXIV.2406.18346 Version Number: 1

  59. [59]

    Philipp Lorenz-Spreen, Lisa Oswald, Stephan Lewandowsky, and Ralph Hertwig. 2022. A systematic review of worldwide causal and correlational evidence on digital media and democracy.Nature Human Behaviour(Nov. 2022), 1–28. doi:10.1038/s41562-022-01460-1

  60. [60]

    Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator. doi:10.48550/arXiv.2505.16690 arXiv:2505.16690 [cs]

  61. [61]

    Nathanson, Elizabeth M

    Amy I. Nathanson, Elizabeth M. Perse, and Douglas A. Ferguson. 1997. Gender differences in television use: An exploration of the instrumental-expressive dichotomy.Communication Research Reports14, 2 (March 1997), 176–188. doi:10.1080/08824099709388659 _eprint: https://doi.org/10.1080/08824099709388659

  62. [62]

    2018.Algorithms of Oppression: How Search Engines Reinforce Racism

    Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press. doi:10.2307/j.ctt1pwt9w5

  63. [63]

    Office of Public Affairs. 2025. Department of Justice Prevails in Landmark Antitrust Case Against Google. https://www.justice.gov/opa/ pr/department-justice-prevails-landmark-antitrust-case-against-google

  64. [64]

    OpenAI. 2024. Evaluating fairness in ChatGPT. https://openai.com/index/evaluating-fairness-in-chatgpt/

  65. [65]

    OpenAI. 2024. Our approach to alignment research. https://openai.com/index/our-approach-to-alignment-research/

  66. [66]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...

  67. [67]

    Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, and He He. 2024. Beyond the Binary: Capturing Diverse Preferences With Reward Regularization. doi:10.48550/arXiv.2412.03822 arXiv:2412.03822 [cs]

  68. [68]

    Discovering Language Model Behaviors with Model-Written Evaluations

    Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kern...

  69. [69]

    Uwe Peters and Benjamin Chin-Yee. 2025. Generalization bias in large language model summarization of scientific research.Royal Society Open Science12, 4 (April 2025), 241776. doi:10.1098/rsos.241776

  70. [70]

    Steve Randerson, Thomas Graydon-Guy, En-Yi Lin, and Sally Casswell. 2025. Exploring the Use of a Large Language Model for Inductive Content Analysis in a Discourse Network Analysis Study.Social Science Computer Review(March 2025), 08944393251326175. doi:10.1177/08944393251326175

  71. [71]

    Varun Nagaraj Rao, Eesha Agarwal, Samantha Dalal, Dan Calacci, and Andrés Monroy-Hernández. 2025. QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums. doi:10.48550/arXiv.2405.05345 arXiv:2405.05345 [cs]

  72. [72]

    2013.The Coding Manual for Qualitative Researchers(3 ed.)

    Johnny Saldaña. 2013.The Coding Manual for Qualitative Researchers(3 ed.). SAGE Publications. https://uk.sagepub.com/en-gb/eur/the- coding-manual-for-qualitative-researchers/book287917

  73. [73]

    Shalom Schwartz. 2012. An Overview of the Schwartz Theory of Basic Values.Online Readings in Psychology and Culture2, 1 (Dec. 2012). doi:10.9707/2307-0919.1116

  74. [74]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. Towards Understanding Sycophancy in Language Models. doi:1...

  75. [75]

    Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens

    Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens. 2025. ...

  76. [76]

    Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanushree Mitra, and Yun Huang. 2025. ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs. doi:10.48550/arXiv.2409.09586 arXiv:2409.09586 [cs]

  77. [77]

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (July 2024), 755–759. doi:10.1038/s41586-024-07566-y

  78. [78]

    participation

    Mona Sloane. 2024. Controversies, contradiction, and “participation” in AI.Big Data & Society11, 1 (March 2024), 20539517241235862. doi:10.1177/20539517241235862

  79. [79]

    Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A Roadmap to Pluralistic Alignment. doi:10.48550/arXiv.2402.05070 arXiv:2402.05070 [cs]

  80. [80]

    Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. 2025. Large Language Models are overconfident and amplify human bias. doi:10.48550/arXiv.2505.02151 arXiv:2505.02151 [cs]

Showing first 80 references.