arxiv: 2306.16388 · v2 · submitted 2023-06-28 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Esin Durmus , Karina Nguyen , Thomas I. Liao , Nicholas Schiefer , Amanda Askell , Anton Bakhtin , Carol Chen , Zac Hatfield-Dodds

show 10 more authors

Danny Hernandez Nicholas Joseph Liane Lovitt Sam McCandlish Orowa Sikder Alex Tamkin Janel Thamkul Jared Kaplan Jack Clark Deep Ganguli

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsglobal opinionsopinion biassurvey responsescross-national dataresponse similarity

0 comments

The pith

Large language models produce answers that match opinions from the United States and certain European and South American countries more closely than opinions from other nations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of questions and answers drawn from cross-national surveys to capture how people in different countries actually respond to the same issues. It then defines a metric that scores how similar a model's generated answers are to the human answers collected from each country. Experiments on one helpful-and-harmless model show that its default outputs line up best with responses from the United States and selected European and South American populations. When the model is explicitly told to answer from a particular country's viewpoint, the similarity shifts toward that population but can also reproduce cultural stereotypes. Translating the survey questions into another language does not automatically make the model's answers match the views of people who speak that language.

Core claim

Using the GlobalOpinionQA dataset and a country-conditioned similarity metric, the authors show that default LLM responses align more closely with the opinion distributions of the USA and some European and South American countries than with the distributions recorded in other surveyed nations.

What carries the argument

A similarity metric that measures how closely an LLM's survey-style answers match the response distribution collected from each country's human respondents.

If this is right

Default model outputs will over-represent the views of a subset of countries on contested global issues.
Explicit country prompts can increase similarity to the target population but may introduce stereotype-like content.
Machine translation of questions alone does not guarantee that answers will track the opinions of speakers of the target language.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditing pipelines built on this dataset could be applied to other models to track changes after training updates or fine-tuning.
The same framework could be extended to track how opinion alignment shifts when models are trained on more geographically balanced data.
Developers might need separate evaluation tracks for factual questions versus value-laden questions when testing global fairness.

Load-bearing premise

The chosen survey answers serve as a fair and unbiased record of what each country's population actually thinks.

What would settle it

A replication in which the same model produces answers whose similarity scores are statistically indistinguishable across all countries in the dataset.

read the original abstract

Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical dataset and metric to check which countries' opinions LLMs default to on societal questions, but the survey ground truth may not fully represent national populations.

read the letter

The main point is that they've built GlobalOpinionQA from existing cross-national surveys and paired it with a similarity metric that scores how close an LLM's answers are to responses from specific countries. This lets them quantify default skews in model outputs on global issues. On the Anthropic Constitutional AI model they test, responses align more with US, some European, and South American populations out of the box. Prompting for a country's perspective moves the outputs closer to that group, but it can also pull in stereotypes. Simply translating the questions does not reliably improve alignment with speakers of the target language. Releasing the dataset on Hugging Face and the interactive visualization is a clear positive for reuse. The framework is straightforward and the three experiments are easy to follow. The soft spot is the assumption that the survey answers stand in for broad population opinions. Cross-national polls often rely on convenience or urban samples, especially in lower-resource countries, so the similarity scores could partly reflect who happened to be surveyed rather than true national views. Some validation against independent representative polls would strengthen the bias claims. The metric details are not fully spelled out in the abstract, but the overall setup looks reproducible once the code is checked. This is aimed at people doing evaluation work on fairness and value alignment in LLMs, particularly those deploying models across regions. A reader building benchmarks or auditing cultural skew would get direct use from the data and the measurement approach. I would send it for peer review. The core contribution is the dataset and the evaluation method, which is new enough and grounded enough to merit referee time even with the survey limitations noted.

Referee Report

3 major / 2 minor

Summary. The paper introduces GlobalOpinionQA, a dataset compiled from cross-national surveys capturing opinions on global issues across countries, along with a similarity metric to compare LLM-generated responses against human responses conditioned on country. Experiments on a Constitutional AI-trained LLM (helpful, honest, and harmless) show that default outputs align more closely with opinions from the USA and certain European and South American populations; country-specific prompting shifts alignments toward the prompted group but risks cultural stereotypes; and translating questions to a target language does not reliably increase similarity to speakers of that language. The dataset is released publicly with an interactive visualization.

Significance. If the metric and findings are robust, the work supplies a practical, reproducible framework for quantifying cultural and geographic biases in LLM opinion representation, with direct relevance to fairness, alignment, and global equity concerns in NLP. The public release of GlobalOpinionQA and the visualization tool strengthens the contribution by enabling independent verification and extension.

major comments (3)

[Section 2] Dataset construction (Section 2): The paper treats responses from the source cross-national surveys as representative ground truth for each country's population-level opinions without discussing or validating against known sampling issues (non-probability sampling, urban/educated oversampling, or small N in non-Western countries). Because the similarity metric and all three experiments rest on this assumption, the central claim that LLMs over-align with USA/European/South American opinions could instead track survey-participant demographics rather than true national opinions.
[Section 3] Similarity metric definition (Section 3): The exact formula for the similarity metric, the statistical tests used to compare distributions, and any controls for question difficulty or response length are not specified. This under-specification directly affects the reliability and interpretability of the quantitative results reported for the default, prompting, and translation experiments.
[Section 5.2] Prompting experiment (Section 5.2): The claim that country-specific prompting can produce harmful cultural stereotypes is stated but lacks concrete examples, frequency counts, or a systematic analysis of stereotype content, making it difficult to evaluate the practical severity of this side effect.

minor comments (2)

[Abstract] The abstract refers to 'some European and South American countries' without naming them; listing the specific countries with highest alignment would improve precision.
[Section 2] The paper provides dataset and visualization links but omits basic summary statistics (number of questions, countries covered, response distributions per question) that would help readers assess coverage and balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with point-by-point responses and indicate planned revisions to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Section 2] Dataset construction (Section 2): The paper treats responses from the source cross-national surveys as representative ground truth for each country's population-level opinions without discussing or validating against known sampling issues (non-probability sampling, urban/educated oversampling, or small N in non-Western countries). Because the similarity metric and all three experiments rest on this assumption, the central claim that LLMs over-align with USA/European/South American opinions could instead track survey-participant demographics rather than true national opinions.

Authors: We acknowledge that the cross-national surveys (e.g., Pew, World Values Survey) underlying GlobalOpinionQA have well-documented sampling limitations, including non-probability methods, urban/educated oversampling, and smaller samples in some non-Western countries. Our framework treats these responses as the best available empirical proxy for reported national opinions rather than claiming they represent 'true' population opinions; the similarity metric is therefore relative to the observed survey distributions. We agree a more explicit discussion is warranted and will revise Section 2 to describe the surveys' sampling characteristics and add a dedicated limitations subsection that caveats all findings accordingly. revision: yes
Referee: [Section 3] Similarity metric definition (Section 3): The exact formula for the similarity metric, the statistical tests used to compare distributions, and any controls for question difficulty or response length are not specified. This under-specification directly affects the reliability and interpretability of the quantitative results reported for the default, prompting, and translation experiments.

Authors: We apologize for the under-specification. The similarity metric averages a distributional divergence (Jensen-Shannon) between the LLM's response distribution and the human response distribution per question, conditioned on country. Statistical comparisons use permutation tests. No explicit controls for question difficulty or response length appear in the main results, though supplementary checks were performed. In revision we will state the precise formula in Section 3, detail the statistical procedures, and add an appendix with robustness analyses that control for response length and question characteristics. revision: yes
Referee: [Section 5.2] Prompting experiment (Section 5.2): The claim that country-specific prompting can produce harmful cultural stereotypes is stated but lacks concrete examples, frequency counts, or a systematic analysis of stereotype content, making it difficult to evaluate the practical severity of this side effect.

Authors: We agree that the stereotype observation requires more concrete support. Our experiments identified instances in which country-specific prompts elicited stereotypical content (e.g., associating particular nationalities with fixed economic or social traits). We will expand Section 5.2 with representative examples, report the approximate frequency of such outputs across the tested prompts based on manual inspection, and include a short qualitative categorization of the stereotype types observed. revision: yes

Circularity Check

0 steps flagged

No circularity; framework uses external survey data as benchmark

full rationale

The paper constructs GlobalOpinionQA from independent cross-national surveys and defines a similarity metric directly against those human responses conditioned on country. No parameters are fitted to the target similarity values, no self-citation chain justifies the core measurement, and the derivation does not reduce to self-definition or renaming. The central claims rest on external benchmarks rather than internal fits or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework treats existing cross-national survey responses as ground-truth distributions of opinion per country; this is a domain assumption rather than a derived quantity.

axioms (1)

domain assumption Responses collected in cross-national surveys accurately reflect the distribution of opinions within each country's population.
The similarity metric conditions directly on these survey answers as the reference distribution for each country.

pith-pipeline@v0.9.0 · 5608 in / 1251 out tokens · 24078 ms · 2026-05-16T07:37:44.317773+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
cs.CL 2026-05 conditional novelty 7.0

LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
cs.AI 2026-05 unverdicted novelty 7.0

Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
cs.CL 2026-05 unverdicted novelty 7.0

XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pa...
Large Language Models Exhibit Normative Conformity
cs.AI 2026-04 unverdicted novelty 7.0

Large language models exhibit normative conformity in addition to informational conformity, and subtle social context can direct which group they conform to.
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment
cs.CL 2026-04 unverdicted novelty 7.0

C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to syn...
Overtrained, Not Misaligned
cs.LG 2026-05 unverdicted novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
cs.CL 2026-05 unverdicted novelty 6.0

LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
The Collapse of Heterogeneity in Silicon Philosophers
cs.CY 2026-04 unverdicted novelty 6.0

Large language models collapse philosophical heterogeneity by over-correlating judgments across domains, creating artificial consensus unlike the views of 277 professional philosophers.
Measuring Opinion Bias and Sycophancy via LLM-based Persuasion
cs.CL 2026-04 unverdicted novelty 6.0

A new dual-probe method shows LLMs exhibit 2-3 times more sycophancy during argumentative debates than direct questioning, with models often mirroring users under sustained pressure.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
cs.CL 2026-04 unverdicted novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
cs.CL 2026-04 unverdicted novelty 5.0

LLMs generate narratives containing persistent stereotypes, erasure, and one-dimensional portrayals of Global Majority national identities, with minoritized groups overrepresented in subordinated roles by more than fi...
When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
cs.CL 2026-04 unverdicted novelty 5.0

Mid-sized LLMs outperform larger models on fairness in multi-document news summarization, with entity sentiment bias proving hardest to mitigate across prompt and judge-based interventions.
Normative Common Ground Replication (NormCoRe): Replication-by-Translation for Studying Norms in Multi-Agent AI
cs.AI 2026-03 conditional novelty 5.0

NormCoRe is a replication-by-translation framework that maps human subject studies onto multi-agent AI environments, showing AI normative judgments on fairness differ from human baselines and vary with model choice an...
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
cs.CL 2026-05 unverdicted novelty 4.0

SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Persistent anti-muslim bias in large language models

Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 298–306, New York, NY , USA, 2021. Association for Comput- ing Machinery. ISBN 9781450384735. doi: 10.1145/3461702.3462624. URL https: //doi.org/10.1145/3461702.3462624

work page doi:10.1145/3461702.3462624 2021
[2]

Subjective natural language problems: Motivations, applications, characterizations, and implications

Cecilia Ovesdotter Alm. Subjective natural language problems: Motivations, applications, characterizations, and implications. In Proceedings of the 49th Annual Meeting of the Associa- tion for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, page 107–112, USA, 2011. Association for Computational Linguistics. ISBN 9...

work page 2011
[3]

Probing pre-trained language models for cross-cultural differences in values

Arnav Arora, Lucie-aimée Kaffee, and Isabelle Augenstein. Probing pre-trained language models for cross-cultural differences in values. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 114–130, Dubrovnik, Croatia, May

work page
[4]

URL https://aclanthology.org/2023

Association for Computational Linguistics. URL https://aclanthology.org/2023. c3nlp-1.12

work page 2023
[5]

A general language assistant as a laboratory for alignment, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a labora...

work page 2021
[6]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, T. J. Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, D...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[8]

Solon Barocas and Andrew D. Selbst. Big data’s disparate impact. California Law Review, 104: 671, 2016

work page 2016
[9]

and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, 11 New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:...

work page doi:10.1145/3442188.3445922 2021
[10]

Berinsky

Adam J. Berinsky. Measuring public opinion with surveys. Annual Review of Political Science, 20(1):309–329, 2017. doi: 10.1146/annurev-polisci-101513-113724. URL https://doi. org/10.1146/annurev-polisci-101513-113724

work page doi:10.1146/annurev-polisci-101513-113724 2017
[11]

Language (technol- ogy) is power: A critical survey of “bias” in NLP

Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technol- ogy) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 5454–5476, Online, July

work page
[12]

doi: 10.18653/v1/2020.acl-main.485

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485

work page doi:10.18653/v1/2020.acl-main.485 2020
[13]

Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

work page 2021
[14]

Creel, Ananya Kumar, Dan Jurafsky, and Percy S Liang

Rishi Bommasani, Kathleen A. Creel, Ananya Kumar, Dan Jurafsky, and Percy S Liang. Picking on the same person: Does algorithmic monoculture lead to outcome homogenization? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3663–3678. Curran Associates, Inc., 20...

work page 2022
[15]

Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop , pages 7–15, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/ v1...

work page 2019
[16]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, S...

work page 1901
[17]

Identity and interaction: a sociocultural linguistic approach

Mary Bucholtz and Kira Hall. Identity and interaction: a sociocultural linguistic approach. Discourse Studies, 7(4-5):585–614, 2005. doi: 10.1177/1461445605054407. URL https: //doi.org/10.1177/1461445605054407. 12

work page doi:10.1177/1461445605054407 2005
[18]

The whiteness of ai

Stephen Cave and Kanta Dihal. The whiteness of ai. Philosophy & Technology, 33:1–19, 12

work page
[19]

doi: 10.1007/s13347-020-00415-6

work page doi:10.1007/s13347-020-00415-6
[20]

Marked personas: Using natural language prompts to measure stereotypes in language models, 2023

Myra Cheng, Esin Durmus, and Dan Jurafsky. Marked personas: Using natural language prompts to measure stereotypes in language models, 2023

work page 2023
[21]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, edi- tors, Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proce...

work page 2017
[22]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

ISBN 9781713829546

Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. Dealing with dis- agreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics , 10:92–110, 2022. doi: 10.1162/tacl\_a\_00449. URL https://aclanthology.org/2022.tacl-1.6

work page internal anchor Pith review doi:10.1162/tacl 2022
[24]

Western civilization, our tradition, Nov 2020

Fad-Admin. Western civilization, our tradition, Nov 2020. URL https://isi.org/ intercollegiate-review/western-civilization-our-tradition/

work page 2020
[25]

Measuring diversity of artificial intelligence conferences

Ana Freire, Lorenzo Porcaro, and Emilia Gómez. Measuring diversity of artificial intelligence conferences. In Deepti Lamba and William H. Hsu, editors, Proceedings of 2nd Workshop on Diversity in Artificial Intelligence (AIDBEI), volume 142 of Proceedings of Machine Learning Research, pages 39–50. PMLR, 09 Feb 2021. URL https://proceedings.mlr.press/ v142...

work page 2021
[26]

Artificial intelligence, values, and alignment

Iason Gabriel. Artificial intelligence, values, and alignment. Minds and Machines , 30(3): 411–437, sep 2020. doi: 10.1007/s11023-020-09539-2. URL https://doi.org/10.1007% 2Fs11023-020-09539-2

work page doi:10.1007/s11023-020-09539-2 2020
[27]

The challenge of value alignment: from fairer algorithms to AI safety

Iason Gabriel and Vafa Ghazavi. The challenge of value alignment: from fairer algorithms to AI safety. CoRR, abs/2101.06060, 2021. URL https://arxiv.org/abs/2101.06060

work page arXiv 2021
[28]

Predictability and surprise in large generative models

Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom B...

work page doi:10.1145/3531146.3533229 2022
[29]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamil˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noem...

work page 2023
[31]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Real- ToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online, November

work page 2020
[32]

doi: 10.18653/v1/2020.findings-emnlp.301

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. URL https://aclanthology.org/2020.findings-emnlp.301. 13

work page doi:10.18653/v1/2020.findings-emnlp.301 2020
[33]

Improving alignment of dialogue agents via targeted human judgements, 2022

Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Mari- beth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soˇna Mokrá,...

work page 2022
[34]

Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori Hashimoto, and Michael S

Mitchell L. Gordon, Kaitlyn Zhou, Kayur Patel, Tatsunori Hashimoto, and Michael S. Bernstein. The disagreement deconvolution: Bringing machine learning performance metrics in line with re- ality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 978...

work page doi:10.1145/3411764.3445423 2021
[35]

Gordon, Michelle S

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. Jury learning: Integrating dissenting voices into machine learning models. In CHI Conference on Human Factors in Computing Systems. ACM, apr 2022. doi: 10.1145/3491102.3502004. URL https://doi.org/10.1145%2F3491102.3502004

work page doi:10.1145/3491102.3502004 2022
[36]

Kivlichan, Rachel Rosen, and Lucy Vasserman

Nitesh Goyal, Ian D. Kivlichan, Rachel Rosen, and Lucy Vasserman. Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction, 6:1–28, 2022

work page 2022
[37]

World values survey: Round seven – country-pooled datafile version 5.0.0, 2022

Christian Haerpfer, Ronald Inglehart, Alejandro Moreno, Christian Welzel, Kseniya Kizilova, Jaime Diez-Medrano, Milena Lagos, Pippa Norris, Eduard Ponarin, and Bianca Puranen. World values survey: Round seven – country-pooled datafile version 5.0.0, 2022

work page 2022
[38]

The political ideology of conver- sational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation, 2023

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. The political ideology of conver- sational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation, 2023

work page 2023
[39]

ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 3309–3326, Dublin, Ireland, May

work page
[40]

doi: 10.18653/v1/2022.acl-long.234

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.234. URL https://aclanthology.org/2022.acl-long.234

work page doi:10.18653/v1/2022.acl-long.234 2022
[41]

Aligning {ai} with shared human values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning {ai} with shared human values. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=dNy_RKzJacY

work page 2021
[42]

Heine, and Ara Norenzayan

Joseph Henrich, Steven J. Heine, and Ara Norenzayan. The weirdest people in the world? Behavioral and Brain Sciences , 33(2-3):61–83, June 2010. ISSN 1469-1825. URL http: //journals.cambridge.org/abstract_S0140525X0999152X

work page 2010
[43]

Izacard, E

Dirk Hovy and Diyi Yang. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588– 602, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. naacl-ma...

work page doi:10.18653/v1/2021 2021
[44]

URL https: //aclanthology.org/2023.rocling-1.1/

Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5491–5501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v...

work page doi:10.18653/v1/ 2020
[45]

Co-writing with opinionated language models affects users’ views

Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman. Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY , USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi: 10.1145/3544548.3581196. URL https:...

work page doi:10.1145/3544548.3581196 2023
[46]

CommunityLM: Probing partisan worldviews from language models

Hang Jiang, Doug Beeferman, Brandon Roy, and Deb Roy. CommunityLM: Probing partisan worldviews from language models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6818–6826, Gyeongju, Republic of Korea, October 2022. 14 International Committee on Computational Linguistics. URL https://aclanthology.org/ 2022.coling-1.593

work page 2022
[47]

The ghost in the machine has an american accent: value conflict in gpt-3, 2022

Rebecca L Johnson, Giada Pistilli, Natalia Menédez-González, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, and Donald Jay Bertulfo. The ghost in the machine has an american accent: value conflict in gpt-3, 2022

work page 2022
[48]

The state and fate of linguistic diversity and inclusion in the NLP world

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-mai...

work page doi:10.18653/v1/2020.acl-main.560 2020
[49]

Language models (mostly) know what they know, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page 2022
[50]

Don’t ask if artificial intelligence is good or fair, ask how it shifts power

Pratyusha Kalluri. Don’t ask if artificial intelligence is good or fair, ask how it shifts power. Nature, 583:169, 2020

work page 2020
[51]

Estimating the Personality of White-Box Language Models

Saketh Reddy Karra, Son The Nguyen, and Theja Tulabandhula. Estimating the Personality of White-Box Language Models. arXiv e-prints, art. arXiv:2204.12000, April 2022. doi: 10.48550/arXiv.2204.12000

work page doi:10.48550/arxiv.2204.12000 2022
[52]

In conversation with artificial intelligence: aligning language models with human values

Atoosa Kasirzadeh and Iason Gabriel. In conversation with artificial intelligence: aligning language models with human values. Philosophy & Technology, 36(2):1–24, 2023

work page 2023
[53]

When do pre-training biases propagate to downstream tasks? a case study in text summarization

Faisal Ladhak, Esin Durmus, Mirac Suzgun, Tianyi Zhang, Dan Jurafsky, Kathleen McKeown, and Tatsunori Hashimoto. When do pre-training biases propagate to downstream tasks? a case study in text summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3206–3219, Dubrovnik, Croatia, May

work page
[54]

URL https://aclanthology.org/2023

Association for Computational Linguistics. URL https://aclanthology.org/2023. eacl-main.234

work page 2023
[55]

Evaluating the output of machine translation systems

Alon Lavie. Evaluating the output of machine translation systems. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Tutorials, 2010

work page 2010
[56]

Towards understanding and mitigating social biases in language models

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021

work page 2021
[57]

Manning, Christopher Ré, Diana Acosta-Navas, Drew A

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page 2022
[58]

URL https://aclanthology.org/2022.acl-long.556

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantasti- cally ordered prompts and where to find them: Overcoming few-shot prompt order sensi- tivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8086–8098, Dublin, Ireland, May 2022. Asso- ciation f...

work page doi:10.18653/v1/2022.acl-long.556 2022
[59]

Gender and representation bias in GPT-3 generated stories

Li Lucy and David Bamman. Gender and representation bias in GPT-3 generated stories. In Proceedings of the Third Workshop on Narrative Understanding, pages 48–55, Virtual, June

work page
[60]

doi: 10.18653/v1/2021.nuse-1.5

Association for Computational Linguistics. doi: 10.18653/v1/2021.nuse-1.5. URL https://aclanthology.org/2021.nuse-1.5. 15

work page doi:10.18653/v1/2021.nuse-1.5 2021
[61]

McConnell-Ginet

S. McConnell-Ginet. Words Matter: Meaning and Power. Cambridge University Press, 2020. ISBN 9781108427210. URL https://books.google.com/books?id=gKVTzQEACAAJ

work page 2020
[62]

StereoSet: Measuring stereotypical bias in pretrained language models

Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Asso- ciation for Computational Linguistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers) , pages 5356–5371, Online, August

work page
[63]

doi: 10.18653/v1/2021.acl-long.416

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long.416

work page doi:10.18653/v1/2021.acl-long.416 2021
[64]

Nationality bias in text generation

Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, Ting-Hao Huang, and Shomir Wilson. Nationality bias in text generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 116–122, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URL https: //aclanthol...

work page 2023
[65]

Bernstein

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, UIST ’22, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9...

work page doi:10.1145/3526113.3545616 2022
[66]

Bender, Emily Denton, and Alex Hanna

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336, 2021. ISSN 2666-3899. doi: https://doi.org/10.1016/ j.patter.2021.100336. URL https://www.sciencedirect.com/science/article/pii/ S2666389921001847

work page arXiv 2021
[67]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, H. Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. CoRR, abs/2202.03286, 2022. URL https://arxiv.org/abs/2202.03286

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kern...

work page 2022
[69]

A human rights-based approach to responsible ai, 2022

Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, and Iason Gabriel. A human rights-based approach to responsible ai, 2022

work page 2022
[70]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Bender, Alex Hanna, and Amandalynne Paullada

Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Amandalynne Paullada. Ai and the everything in the whole wide world benchmark. In J. Van- schoren and S. Yeung, editors, Proceedings of the Neural Information Process- ing Systems Track on Datasets and Benchmarks , volume 1. Curran, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/pa...

work page 2021
[72]

Characteristics of harmful text: Towards rigorous benchmarking of language models, 2022

Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Wei- dinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, and Lisa Anne Hendricks. Characteristics of harmful text: Towards rigorous benchmarking of language models, 2022

work page 2022
[73]

Why You Should Do NLP Beyond English

Sebastian Ruder. Why You Should Do NLP Beyond English. http://ruder.io/ nlp-beyond-english, 2020

work page 2020
[74]

Re-imagining algorithmic fairness in india and beyond

Nithya Sambasivan, Erin Arnesen, Ben Hutchinson, Tulsee Doshi, and Vinodkumar Prabhakaran. Re-imagining algorithmic fairness in india and beyond. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 315–328, New York, NY , USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3...

work page doi:10.1145/3442188.3445896 2021
[75]

Whose opinions do language models reflect?, 2023

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect?, 2023

work page 2023
[76]

S2ORC: The Semantic Scholar Open Research Corpus

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. Social bias frames: Reasoning about social and power implications of language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/202...

work page doi:10.18653/v1/2020.acl-main 2020
[77]

URL https://aclanthology.org/2020.acl-main.486

work page 2020
[78]

Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, and Noah A. Smith. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 5884–5906, Seattle...

work page doi:10.18653/v1/2022 2022
[79]

Selbst, Danah Boyd, Sorelle A

Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT*’19, page 59–68, New York, NY , USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.328...

work page doi:10.1145/3287560.3287598 2019
[80]

The woman worked as a babysitter: On biases in language generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China, November...

work page doi:10.18653/v1/d19-1339 2019

Showing first 80 references.