What Do People Actually Want From AI? Mapping Preference Plurality
Pith reviewed 2026-06-28 01:29 UTC · model grok-4.3
The pith
People's preferences for AI diverge sharply, with even 'truthfulness' carrying incompatible meanings across respondents that single reward models cannot capture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that preference plurality and semantic divergence in open-ended responses expose fundamental limits in RLHF alignment: when nearly half request truthfulness but define it differently, and when contextual distinctions exceed binary comparisons, current aggregation into a single reward model flattens situated signals and fails to match actual user demands, as seen in persistent high hallucination rates despite clear accuracy preferences.
What carries the argument
Qualitative coding of open-ended responses from the PRISM dataset that surfaces preference plurality and incompatible epistemological bases behind shared terms.
If this is right
- A single reward model is unlikely to satisfy the varied definitions of truthfulness.
- Binary comparisons cannot encode distinctions between default and requested behaviors.
- Controversial features like guardrails will produce both demand and rejection.
- Persistent hallucinations indicate that current methods do not identify the accuracy preferences expressed in the data.
Where Pith is reading between the lines
- Alignment systems may require multiple or context-switching models rather than one universal preference function.
- Future preference datasets would benefit from including open-ended questions alongside ratings to avoid flattening signals.
- The findings connect to broader questions about how to handle value pluralism when building public AI systems.
Load-bearing premise
The qualitative reading of divergent meanings in the responses correctly identifies incompatible bases without substantial researcher bias or sampling effects.
What would settle it
A single reward model trained only on binary comparisons that produces outputs matching the full range of definitions for truthfulness given in the open responses.
read the original abstract
Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes 1,500 open-ended responses from the PRISM dataset across 75 countries to map user preferences for AI systems. It reports that preferences are highly diverse, with most values requested by fewer than 25% of respondents and truthfulness the only exception at 49%; however, open-ended elaborations on truthfulness reveal divergent epistemological bases (sourced claims vs. expert opinion vs. unpopular views). The work further identifies controversial features (e.g., human-likeness, guardrails) and contextual distinctions (default vs. requested behavior) that binary preference comparisons cannot capture, concluding that these pluralities cannot be aggregated into single reward models and that current RLHF practices therefore fail to identify actual preferences.
Significance. If the qualitative interpretations are robust, the findings provide concrete empirical grounding for known limitations of RLHF, showing that lexical agreement on values like truthfulness masks incompatible underlying demands. This could motivate development of alignment methods that handle preference plurality and context rather than forcing aggregation, and the use of a large multi-country open-ended dataset is a strength relative to typical binary-feedback studies.
major comments (2)
- [Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.
- [Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.
minor comments (2)
- [Abstract] The abstract is dense and could be split for clarity; the phrase 'epistemic violence' is used without a direct citation to the source work that introduced the characterization.
- [Results] Figure or table summarizing the distribution of requested values (beyond the 49% truthfulness figure) would help readers assess the 'most values requested by fewer than a quarter' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Methods / Results (qualitative analysis of truthfulness responses)] The central claim that 49% of respondents request truthfulness but with incompatible epistemological bases (and thus cannot be captured by a single reward model) rests entirely on the authors' qualitative coding of open-ended responses. No details are supplied on the coding protocol, number of coders, inter-rater reliability (e.g., Cohen's kappa), coder training, or steps to mitigate researcher framing bias. This absence is load-bearing for the incompatibility interpretation and for the broader argument about epistemic violence in alignment.
Authors: We agree that the manuscript lacks sufficient detail on the qualitative coding process, which is important for evaluating the robustness of the incompatibility claims. The coding was conducted iteratively by the lead authors through repeated review and discussion to identify and categorize the distinct epistemological bases (e.g., sourced claims, expert opinion, unpopular views). No formal inter-rater reliability statistic such as Cohen's kappa was computed, and coder training was informal. In revision, we will add a dedicated Methods subsection describing the full protocol, the number of coders, the consensus process, and any steps taken to reduce framing bias (such as using multiple independent readings before joint discussion). This addition will directly support the interpretation without altering the core findings. revision: yes
-
Referee: [Section describing the PRISM dataset and sample] The manuscript states concrete demographic and exclusion criteria are absent from the abstract and provides no information on sample demographics, response exclusion rules, or representativeness of the 1,500 PRISM responses. These omissions undermine the generalizability claim that current methods 'fail to identify actual preferences' across populations.
Authors: We acknowledge that the manuscript does not include a self-contained summary of the PRISM sample characteristics, exclusion rules, or representativeness, which limits assessment of generalizability. Although the PRISM dataset paper provides these details, the current work does not extract or discuss them. We will revise the manuscript to add a subsection on the dataset and sample, including available demographic breakdowns, any exclusion criteria applied to arrive at the 1,500 responses, and an explicit discussion of representativeness limitations. This will better ground the claims about preferences across populations. revision: yes
Circularity Check
No circularity: empirical survey with no derivation or fitting chain
full rationale
The paper is a qualitative and quantitative analysis of 1,500 open-ended survey responses from the PRISM dataset. It reports frequencies (e.g., truthfulness requested by 49%), codes open responses for divergent meanings, and draws interpretive conclusions about alignment methods. No equations, parameters, predictions, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained against external benchmarks (survey data) and receives the default non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 1,500 open-ended responses in the PRISM dataset provide representative and unbiased signals of what people actually want from AI.
Reference graph
Works this paper leans on
-
[1]
Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher
Arsenii Alenichev, Jonathan D. Shaffer, Patricia Kingori, Koen Peeters Grietens, James Muldoon, and Luc Rocher. 2025. ‘We can see a savage’: a case study of the colonial gaze in generative AI algorithms.AI & SOCIETY(Nov. 2025). doi:10.1007/s00146-025-02685-0
-
[2]
Anthropic. [n. d.]. Alignment Research. https://www.anthropic.com/research/team/alignment
-
[3]
Anthropic. 2023. Claude’s Constitution. https://www.anthropic.com/news/claudes-constitution
2023
-
[4]
Lora Aroyo and Chris Welty. 2015. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation.AI Magazine36, 1 (March 2015), 15–24. doi:10.1609/aimag.v36i1.2564
-
[5]
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. 2021. A General Language Assistant as a ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.00861 2021
-
[6]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073 2022
-
[7]
Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. HalluLens: LLM Hallucination Benchmark. doi:10.48550/arXiv.2504.17550 arXiv:2504.17550 [cs]
-
[8]
Christoph Bartneck. 2023. Godspeed Questionnaire Series: Translations and Usage. InInternational Handbook of Behavioral Health Assessment. Springer, Cham, 1–35. doi:10.1007/978-3-030-89738-3_24-1
-
[9]
Robert Booth and Lisa O’Carroll. 2025. Meta found in breach of EU law over‘ineffective’complaints system for flagging illegal content. The Guardian(Oct. 2025). https://www.theguardian.com/technology/2025/oct/24/instagram-facebook-breach-eu-law-content-flagging
2025
-
[10]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (Jan. 2006), 77–101. doi:10.1191/1478088706qp063oa _eprint: https://doi.org/10.1191/1478088706qp063oa
-
[11]
Flavio Calvino, Daniel Haerle, and Sarah Liu. 2025. Is generative AI a General Purpose Technology?: Implications for productivity and policy.OECD Artificial Intelligence Papers(June 2025). doi:10.1787/704e2d12-en
-
[12]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.15217 2023
-
[13]
2023.Imagining AI: How the World Sees Intelligent Machines
Stephen Cave and Kanta Dihal (Eds.). 2023.Imagining AI: How the World Sees Intelligent Machines. Oxford University Press, Oxford, New York
2023
-
[14]
John Chen, Alexandros Lotsos, Grace Wang, Lexie Zhao, Bruce Sherin, Uri Wilensky, and Michael Horn. 2025. Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets. doi:10.48550/arXiv.2504.02887 arXiv:2504.02887 [cs]
-
[15]
European Commission. 2025. Commission finds Apple and Meta in breach of the Digital Markets Act. https://ec.europa.eu/commission/ presscorner/detail/en/ip_25_1085
2025
-
[16]
European Commission. 2025. Commission fines X€120 million under the Digital Services Act | Shaping Europe ’s digital future. https://digital-strategy.ec.europa.eu/en/news/commission-fines-x-eu120-million-under-digital-services-act
2025
-
[17]
Holliday, Bob M
Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mosse, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, and William S. Zwicker. 2024. Position: Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. InProceedings of the 41st International Conference on Mach...
2024
-
[18]
Juliet Corbin and Anselm Strauss. 2008.Basics of Qualitative Research (3rd ed.): Techniques and Procedures for Developing Grounded Theory. SAGE Publications, Inc. doi:10.4135/9781452230153
-
[19]
Stefano De Paoli. 2024. Performing an Inductive Thematic Analysis of Semi-Structured Interviews With a Large Language Model: An Exploration and Provocation on the Limits of the Approach.Social Science Computer Review42, 4 (Aug. 2024), 997–1019. doi:10.1177/ 08944393231220483
2024
-
[20]
Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err.Journal of Experimental Psychology: General144, 1 (2015), 114–126. doi:10.1037/xge0000033 What Do People Actually Want From AI? Mapping Preference Plurality FAccT ’26, June 25–28, 2026, Montreal, QC, Canada
-
[21]
Brian D. Earp, Killian L. McLoughlin, Joshua T. Monrad, Margaret S. Clark, and Molly J. Crockett. 2021. How social relationships shape moral wrongness judgments.Nature Communications12, 1 (Oct. 2021), 5776. doi:10.1038/s41467-021-26067-4
-
[22]
Hans Esselborn. 2023. German Science Fiction Literature Exploring AI: Expectations, Hopes, and Fears. InImagining AI: How the World Sees Intelligent Machines, Stephen Cave and Kanta Dihal (Eds.). Oxford University Press, 0. doi:10.1093/oso/9780192865366.003.0005
-
[23]
Facebook. 2015. Facebook’s 5 Core Values. https://www.facebook.com/media/set/?set=a.1655178611435493.1073741828. 1633466236940064
arXiv 2015
-
[24]
Michael Feffer, Michael Skirpan, Zachary Lipton, and Hoda Heidari. 2023. From Preference Elicitation to Participatory ML: A Critical Survey & Guidelines for Future Research. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’23). Association for Computing Machinery, New York, NY, USA, 38–48. doi:10.1145/3600211.3604661
-
[25]
Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30, 3 (Sept. 2020), 411–437. doi:10.1007/s11023- 020-09539-2
-
[26]
Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182, 7 (July 2025), 1951–1973. doi:10.1007/s11098-025-02300-4
-
[27]
Eva Johanna Gengler. 2024. Sexism, Racism, and Classism: Social Biases in Text-to-Image Generative AI in the Context of Power, Success, and Beauty.Wirtschaftsinformatik 2024 Proceedings(Jan. 2024). https://aisel.aisnet.org/wi2024/48
2024
-
[28]
2023.Trust in Artificial Intelligence: A global study
Nicole Gillespie, Steven Lockey, Caitlin Curtis, Javad Pool, and Ali Akbari. 2023.Trust in Artificial Intelligence: A global study. Technical Report. The University of Queensland; KPMG Australia, Brisbane, Australia. doi:10.14264/00d3c94
-
[29]
Google. [n. d.]. Our approach - how Google Search works. https://www.google.com/intl/en_uk/search/howsearchworks/our-approach
-
[30]
Xin Han, Marten H. L. Kaas, and Cuizhu Dawn Wang. 2025. A Cross-Cultural Examination of Fairness Beliefs in Human-AI Interaction. doi:10.2139/ssrn.5116823
-
[31]
William Hobbs and Jon Green. 2025. Categorizing Topics Versus Inferring Attitudes: A Theory and Method for Analyzing Open-ended Survey Responses.Political Analysis33, 3 (July 2025), 231–251. doi:10.1017/pan.2024.23
-
[32]
Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. AI generates covertly racist decisions about people based on their dialect.Nature633, 8028 (Sept. 2024), 147–154. doi:10.1038/s41586-024-07856-5
-
[33]
Michel Hohendanner, Chiara Ullstein, Bukola Abimbola Onyekwelu, Amelia Katirai, Jun Kuribayashi, Olusola Babalola, Arisa Ema, and Jens Grossklags. 2025. Initiating the Global AI Dialogues: Laypeople Perspectives on the Future Role of genAI in Society from Nigeria, Germany and Japan. InProceedings of the 2025 CHI Conference on Human Factors in Computing Sy...
-
[34]
Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli
Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective Constitutional AI: Aligning a Language Model with Public Input. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 1395–1417. doi:10.1145/3630106.3658979 arXiv:2406.07814 [cs]
-
[35]
Simon Hughes, Minseok Bae, and Miaoran Li. 2023. Vectara Hallucination Leaderboard. https://github.com/vectara/hallucination- leaderboard original-date: 2023-10-31T21:19:12Z
2023
-
[36]
Lujain Ibrahim, Franziska Sofia Hafner, and Luc Rocher. 2025. Training language models to be warm and empathetic makes them less reliable and more sycophantic. doi:10.48550/arXiv.2507.21919 arXiv:2507.21919 [cs] version: 2
-
[37]
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.19852 2025
-
[38]
Jiun-Yin Jian, Ann M. Bisantz, and Colin G. Drury. 2000. Foundations for an Empirically Determined Scale of Trust in Auto- mated Systems.International Journal of Cognitive Ergonomics4, 1 (March 2000), 53–71. doi:10.1207/S15327566IJCE0401_04 _eprint: https://doi.org/10.1207/S15327566IJCE0401_04
-
[39]
Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2025. A Survey on Human Preference Learning for Aligning Large Language Models.ACM Comput. Surv.58, 6 (Dec. 2025), 152:1–152:39. doi:10.1145/3773279
-
[40]
Courtney Johnson and Alec Tyson. 2020. Are AI and job automation good for society? Globally, views are mixed. https://www.pewresearch.org/short-reads/2020/12/15/people-globally-offer-mixed-views-of-the-impact-of-artificial-intelligence- job-automation-on-society/
2020
-
[41]
Ekaterina Jussupow, Izak Benbasat, and Armin Heinzl. 2020. Why Are We Averse Towards Algorithms? A Comprehensive Literature Review on Algorithm Aversion.ECIS 2020 Research Papers(June 2020). https://aisel.aisnet.org/ecis2020_rp/168
2020
-
[42]
Dongjun Kang, Joonsuk Park, Yohan Jo, and JinYeong Bak. 2023. From Values to Opinions: Predicting Human Behaviors and Stances Using Value-Injected Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sing...
-
[43]
Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu Meena SP, and Nithya Sambasivan. 2022. ”Because AI is 100% right and safe”: User Attitudes and Sources of AI Authority in India. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems FAccT ’26, June 25–28, 2026, Montreal, QC, Canada Julia Sepúlveda Coelho and Scott A. Hale (CHI ’22...
-
[44]
Platforms & Society , author =
Francisco W. Kerche, Matthew Zook, and Mark Graham. 2026. The silicon gaze: A typology of biases and inequality in LLMs through the lens of place.Platforms & Society3 (March 2026), 29768624251408919. doi:10.1177/29768624251408919
-
[45]
Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, and Scott A. Hale. 2024. Indian-BhED: A Dataset for Measuring India-Centric Biases in Large Language Models. InProceedings of the 2024 International Conference on Information Technology for Social Good. 231–239. doi:10.1145/3677525.3678666 arXiv:2309.08573 [cs]
-
[46]
Johannes Kiesel, Milad Alshomary, Nicolas Handke, Xiaoni Cai, Henning Wachsmuth, and Benno Stein. 2022. Identifying the Human Values behind Arguments. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computatio...
-
[47]
Bean, Bertie Vidgen, Paul Röttger, and Scott A
Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). As...
-
[48]
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2023. The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models. doi:10.48550/arXiv.2310.02457 arXiv:2310.02457 [cs]
-
[49]
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals.Nature Machine Intelligence6, 4 (April 2024), 383–392. doi:10.1038/s42256-024-00820-y
-
[50]
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...
arXiv 2024
-
[51]
Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference (CI ’23). Association for Computing Machinery, New York, NY, USA, 12–24. doi:10.1145/3582269.3615599
-
[52]
Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, and Mateusz Maria Jurewicz. 2025. DarkBench: Benchmarking Dark Patterns in Large Language Models. doi:10.48550/arXiv.2503.10728 arXiv:2503.10728 [cs]
-
[53]
Smith, and Hannaneh Hajishirzi
Sachin Kumar, Chan Young Park, Yulia Tsvetkov, Noah A. Smith, and Hannaneh Hajishirzi. 2024. ComPO: Community Preferences for Language Model Personalization. (2024). doi:10.48550/ARXIV.2410.16027 Version Number: 1
-
[54]
Cherie Lacey and Catherine Caudwell. 2019. Cuteness as a ‘Dark Pattern’ in Home Robots. In2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI). 374–381. doi:10.1109/HRI.2019.8673274 ISSN: 2167-2148
-
[55]
Nathan Lambert and Roberto Calandra. 2024. The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback. doi:10.48550/arXiv.2311.00168 arXiv:2311.00168 [cs]
-
[56]
2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control
Vili Lehdonvirta. 2022.Cloud Empires: How Digital Platforms Are Overtaking the State and How We Can Regain Control. The MIT Press. doi:10.7551/mitpress/14219.001.0001
-
[57]
Yuyun Li. 2024. Regulatory disputes between Brazil and X | Feature from King’s College London. https://www.kcl.ac.uk/regulatory- disputes-between-brazil-and-x
2024
-
[58]
Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, and Roel Dobbe. 2024. AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations. (2024). doi:10.48550/ARXIV.2406.18346 Version Number: 1
-
[59]
Philipp Lorenz-Spreen, Lisa Oswald, Stephan Lewandowsky, and Ralph Hertwig. 2022. A systematic review of worldwide causal and correlational evidence on digital media and democracy.Nature Human Behaviour(Nov. 2022), 1–28. doi:10.1038/s41562-022-01460-1
-
[60]
Beier Luo, Shuoyuan Wang, Sharon Li, and Hongxin Wei. 2025. Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator. doi:10.48550/arXiv.2505.16690 arXiv:2505.16690 [cs]
-
[61]
Amy I. Nathanson, Elizabeth M. Perse, and Douglas A. Ferguson. 1997. Gender differences in television use: An exploration of the instrumental-expressive dichotomy.Communication Research Reports14, 2 (March 1997), 176–188. doi:10.1080/08824099709388659 _eprint: https://doi.org/10.1080/08824099709388659
-
[62]
2018.Algorithms of Oppression: How Search Engines Reinforce Racism
Safiya Umoja Noble. 2018.Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press. doi:10.2307/j.ctt1pwt9w5
-
[63]
Office of Public Affairs. 2025. Department of Justice Prevails in Landmark Antitrust Case Against Google. https://www.justice.gov/opa/ pr/department-justice-prevails-landmark-antitrust-case-against-google
2025
-
[64]
OpenAI. 2024. Evaluating fairness in ChatGPT. https://openai.com/index/evaluating-fairness-in-chatgpt/
2024
-
[65]
OpenAI. 2024. Our approach to alignment research. https://openai.com/index/our-approach-to-alignment-research/
2024
-
[66]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human fee...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
-
[67]
Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, and He He. 2024. Beyond the Binary: Capturing Diverse Preferences With Reward Regularization. doi:10.48550/arXiv.2412.03822 arXiv:2412.03822 [cs]
-
[68]
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kern...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09251 2022
-
[69]
Uwe Peters and Benjamin Chin-Yee. 2025. Generalization bias in large language model summarization of scientific research.Royal Society Open Science12, 4 (April 2025), 241776. doi:10.1098/rsos.241776
-
[70]
Steve Randerson, Thomas Graydon-Guy, En-Yi Lin, and Sally Casswell. 2025. Exploring the Use of a Large Language Model for Inductive Content Analysis in a Discourse Network Analysis Study.Social Science Computer Review(March 2025), 08944393251326175. doi:10.1177/08944393251326175
-
[71]
Varun Nagaraj Rao, Eesha Agarwal, Samantha Dalal, Dan Calacci, and Andrés Monroy-Hernández. 2025. QuaLLM: An LLM-based Framework to Extract Quantitative Insights from Online Forums. doi:10.48550/arXiv.2405.05345 arXiv:2405.05345 [cs]
-
[72]
2013.The Coding Manual for Qualitative Researchers(3 ed.)
Johnny Saldaña. 2013.The Coding Manual for Qualitative Researchers(3 ed.). SAGE Publications. https://uk.sagepub.com/en-gb/eur/the- coding-manual-for-qualitative-researchers/book287917
2013
-
[73]
Shalom Schwartz. 2012. An Overview of the Schwartz Theory of Basic Values.Online Readings in Psychology and Culture2, 1 (Dec. 2012). doi:10.9707/2307-0919.1116
-
[74]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. 2025. Towards Understanding Sycophancy in Language Models. doi:1...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.13548 2025
-
[75]
Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, and David Jurgens. 2025. ...
-
[76]
Hua Shen, Tiffany Knearem, Reshmi Ghosh, Yu-Ju Yang, Nicholas Clark, Tanushree Mitra, and Yun Huang. 2025. ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs. doi:10.48550/arXiv.2409.09586 arXiv:2409.09586 [cs]
-
[77]
Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. AI models collapse when trained on recursively generated data.Nature631, 8022 (July 2024), 755–759. doi:10.1038/s41586-024-07566-y
-
[78]
Mona Sloane. 2024. Controversies, contradiction, and “participation” in AI.Big Data & Society11, 1 (March 2024), 20539517241235862. doi:10.1177/20539517241235862
-
[79]
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A Roadmap to Pluralistic Alignment. doi:10.48550/arXiv.2402.05070 arXiv:2402.05070 [cs]
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.05070 2024
-
[80]
Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. 2025. Large Language Models are overconfident and amplify human bias. doi:10.48550/arXiv.2505.02151 arXiv:2505.02151 [cs]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.