Free-form Association Tasks Reveal Stereotype Hallucination in Large Language Models

Amir Goldberg; Douglas Guilbeault; Xinrui Chloe Zhao

arxiv: 2606.30945 · v1 · pith:5CKP46LKnew · submitted 2026-06-29 · 💻 cs.CY

Free-form Association Tasks Reveal Stereotype Hallucination in Large Language Models

Xinrui Chloe Zhao , Douglas Guilbeault , Amir Goldberg This is my paper

Pith reviewed 2026-07-01 00:55 UTC · model grok-4.3

classification 💻 cs.CY

keywords stereotype hallucinationlarge language modelsfirst-order responsessecond-order predictionsabstract stimuliRorschach blotssocial groupsfine-tuning

0 comments

The pith

LLMs generate stark second-order stereotypes that neither amplify their own first-order responses nor match actual human group differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how humans and large language models interpret abstract art and Rorschach blots, stimuli chosen because they lack fixed cultural meanings. Humans produce varied first-order interpretations across individuals and only moderately amplify group patterns when asked to predict how others will respond. LLMs instead give uniform first-order answers yet produce strong second-order stereotypes for groups defined by gender, partisanship, personality, and other domains; these stereotypes do not build on the models' own responses and do not track measured human differences. The same pattern appears even after the models are fine-tuned directly on participant data. The results point to limits on using LLMs to predict or simulate human judgments in open-ended situations.

Core claim

In free-form association tasks with abstract art and Rorschach blots, humans display heterogeneous first-order responses with minimal group structure and engage in moderate stereotype exaggeration when making second-order predictions about social groups. LLMs produce homogeneous first-order responses yet generate stark second-order stereotypes that neither amplify their first-order tendencies nor reflect actual human group differences, a process the authors term stereotype hallucination. This hallucination persists when the models are fine-tuned on response data collected from actual participants, indicating that LLMs do not emulate the cognitive processes underlying human stereotypes in nov

What carries the argument

The contrast between first-order personal interpretations and second-order predictions of group responses to abstract stimuli that lack pre-established cultural meanings, which isolates stereotype hallucination as the generation of ungrounded group stereotypes by LLMs.

If this is right

LLMs cannot be treated as reliable simulators of the cognitive processes that produce human stereotypes.
Fine-tuning on human response data does not remove the generation of ungrounded second-order stereotypes.
LLMs have significant limitations when used to model or predict human behavior in contexts that involve diverse or novel interpretations.
Stereotype hallucination constitutes a distinct mechanism from the stereotype exaggeration observed in humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same first-order versus second-order mismatch could appear in other open-ended tasks such as story completion or image captioning when group predictions are requested.
Applications that rely on LLMs to anticipate public reactions to ambiguous content may systematically overstate social divisions.
Designers could probe for hallucination by comparing a model's direct associations with its forecasts about demographic subgroups on new abstract inputs.

Load-bearing premise

Abstract art and Rorschach blots lack pre-established cultural meanings, so differences between human and LLM responses can be attributed to hallucination rather than retrieval of learned associations.

What would settle it

Finding that LLMs' second-order stereotypes in these tasks closely match either their own first-order responses or the actual measured differences across human social groups would falsify the claim of stereotype hallucination.

Figures

Figures reproduced from arXiv: 2606.30945 by Amir Goldberg, Douglas Guilbeault, Xinrui Chloe Zhao.

**Figure 1.** Figure 1: Experimental design comparing human and LLM associative processes across first-order and second-order tasks. The survey pipeline (top) shows human participants randomly assigned to either firstorder tasks (n=300, providing direct interpretations of 4 random stimuli) or second-order tasks (n=300 per domain, predicting group interpretations for 2 random stimuli for one of the five social domains). The LLM p… view at source ↗

**Figure 2.** Figure 2: Classification methodology for measuring between-group separability of associative patterns. Word associations (3 objects, 3 concepts, 1 emotion) from each social domain by each respondent are fed into a standardized prompt asking LLM classifiers to determine group membership based solely on linguistic patterns. Both GPT-4o mini and Llama-3.2-11B-Vision-Instruct serve as independent classifiers to validate… view at source ↗

**Figure 3.** Figure 3: Entropy of word association distributions with bootstrap error estimation by order, type of association, domain, and type of respondent. This figure displays the entropy values (displayed on the x-axis) of associations distributions for responses of human participants (blue circles, labeled as Human), GPT-4o mini (orange squares, labeled as GPT) and Llama-3.2-11B-Vision-Instruct (green diamonds, labeled as… view at source ↗

**Figure 4.** Figure 4: Semantic space analysis reveals LLM stereotype amplification. (A) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Similarity analysis across domains for combined associations with fuzzy matching [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Recent studies argue that LLMs can predict human stereotypical judgments. Yet whether LLMs emulate the cognitive processes underlying human stereotypes, or merely retrieve learned associations to solve prediction tasks, remains unclear. Prior work examines LLMs' stereotypes in either (i) controlled judgment tasks like multiple choice surveys, or (ii) contexts constrained by conventionalized and predictable group biases. Here, we compare the structure of the stereotypes that humans and LLMs exhibit in the interpretation of free-form stimuli, namely abstract art and Rorschach blots, which lack pre-established cultural meanings. We recruit participants across five social domains (gender, partisanship, personality, urbanicity, and lifestyle) and elicit both first-order (direct personal interpretations) and second-order responses (predictions about how members of social groups will interpret the stimuli); we replicate this design with two multimodal models (GPT-4o mini and Llama-3.2-11B-Vision-Instruct). Humans and LLMs differ not only in magnitude but in the qualitative nature of their stereotypes. Human first-order responses display heterogeneity with minimal group structure. When predicting group responses, humans engage in "stereotype exaggeration" by moderately amplifying first-order tendencies while preserving diversity. By contrast, LLMs exhibit homogeneous first-order responses, and yet generate stark second-order stereotypes that neither amplify existing first-order tendencies nor reflect actual human group differences, a process we term "stereotype hallucination." LLMs continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants. These findings suggest significant limitations in the use of LLMs to model and predict human behavior in novel contexts involving diverse interpretations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs producing uniform first-order responses to abstract stimuli but then strong second-order group stereotypes that do not track their own data or human patterns, yet the hallucination label rests on an unverified claim that the stimuli carry no training-data associations.

read the letter

The main takeaway is that humans give varied first-order interpretations of abstract art and Rorschach blots and then moderately amplify group tendencies when predicting others, while the two tested multimodal models give consistent first-order answers but then generate second-order stereotypes that neither match their first-order outputs nor actual human differences. The authors label this mismatch stereotype hallucination and report that fine-tuning on the human response data does not eliminate it.

The design choice to use free-form tasks on stimuli presented as culturally neutral is a clear step beyond the multiple-choice setups in earlier work. Comparing first-order and second-order structure across five domains also lets them separate exaggeration from outright invention. The fine-tuning check is a practical addition that rules out a simple data-mismatch explanation.

The central weakness is the premise that these specific images have no learnable associations in the models' pretraining. Rorschach blots and abstract art appear in psychology papers, datasets, and web content; without an explicit check for overlap or a zero-shot baseline on the exact stimuli, the structured second-order outputs could reflect statistical regularities rather than a distinct generative process. The abstract also gives no sample sizes, prompt wording, or statistical tests, so the size and reliability of the reported differences cannot be judged from the available text.

This is relevant for anyone using LLMs to simulate human social cognition or behavior in open-ended settings. A reader working on behavioral modeling or AI safety would find the distinction worth testing. The paper deserves peer review because the question is timely and the framing is coherent, even though the stimuli-association issue and missing quantitative details will need direct attention in revision.

Referee Report

3 major / 2 minor

Summary. The paper compares human and LLM (GPT-4o mini, Llama-3.2-11B-Vision-Instruct) responses on free-form association tasks using abstract art and Rorschach blots across five domains (gender, partisanship, personality, urbanicity, lifestyle). Humans exhibit heterogeneous first-order interpretations and moderate 'stereotype exaggeration' in second-order group predictions; LLMs show homogeneous first-order responses yet generate stark second-order stereotypes unrelated to first-order patterns or actual human differences, termed 'stereotype hallucination,' which persists after fine-tuning on human data. The design uses stimuli argued to lack pre-established cultural meanings to isolate generative processes from retrieval.

Significance. If robust, the work offers a useful empirical distinction between human stereotype exaggeration and LLM stereotype hallucination in novel contexts, highlighting limits on LLMs as models of human social cognition. The free-form, non-conventionalized stimuli and direct first-/second-order comparison are strengths; the persistence after fine-tuning is a notable negative result.

major comments (3)

[Introduction/Methods] Introduction and Methods: The central distinction between 'stereotype hallucination' and retrieval of learned associations rests on the claim that abstract art and Rorschach blots 'lack pre-established cultural meanings.' No check for the presence of these exact images (or close visual analogs) in web-scale pretraining corpora, psychology literature, or art datasets is reported; without such verification the attribution of structured second-order outputs to hallucination rather than statistical regularities remains unestablished.
[Methods] Methods: The abstract describes a comparative design with humans and two models across five domains but supplies no sample sizes, number of stimuli per participant, exact prompt wording, statistical tests for distribution comparisons, or controls for order/presentation effects. These details are load-bearing for evaluating whether the reported qualitative differences (homogeneous vs. heterogeneous first-order responses) are reliable.
[Results] Results (fine-tuning experiment): The claim that LLMs 'continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants' is central, yet no details are given on fine-tuning dataset size, procedure, hyperparameters, or how second-order outputs were evaluated post-fine-tuning. This prevents assessment of whether the hallucination effect is robust or an artifact of insufficient adaptation.

minor comments (2)

[Methods] Clarify operational definitions of 'first-order' and 'second-order' responses and how homogeneity vs. heterogeneity was quantified (e.g., entropy, variance metrics) in the Methods section.
Figure captions and tables should report exact participant and stimulus counts, confidence intervals, and p-values for all human-LLM comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where additional clarity will strengthen the manuscript. We address each major comment below, indicating revisions where the manuscript will be updated to incorporate the feedback.

read point-by-point responses

Referee: [Introduction/Methods] Introduction and Methods: The central distinction between 'stereotype hallucination' and retrieval of learned associations rests on the claim that abstract art and Rorschach blots 'lack pre-established cultural meanings.' No check for the presence of these exact images (or close visual analogs) in web-scale pretraining corpora, psychology literature, or art datasets is reported; without such verification the attribution of structured second-order outputs to hallucination rather than statistical regularities remains unestablished.

Authors: We acknowledge that an exhaustive verification of these specific images across all pretraining corpora was not performed, as such a search is computationally prohibitive at web scale. The stimuli were selected from established psychological instruments (Rorschach blots) and abstract art explicitly chosen for their documented ambiguity and absence of fixed cultural referents in the relevant literature. We will revise the Methods section to expand the stimulus selection rationale, cite supporting references on the interpretive openness of these materials, and explicitly note the limitation regarding direct corpus checks while arguing that the free-form task design isolates generative processes from simple retrieval. revision: partial
Referee: [Methods] Methods: The abstract describes a comparative design with humans and two models across five domains but supplies no sample sizes, number of stimuli per participant, exact prompt wording, statistical tests for distribution comparisons, or controls for order/presentation effects. These details are load-bearing for evaluating whether the reported qualitative differences (homogeneous vs. heterogeneous first-order responses) are reliable.

Authors: The full manuscript contains these details, but we agree they should be more prominently and systematically presented. We will revise the Methods section to include explicit reporting of human sample sizes, stimuli counts per domain and participant, verbatim prompt templates, the specific statistical tests employed for comparing response distributions, and procedures used to counterbalance order and presentation effects. A new subsection on experimental controls will be added. revision: yes
Referee: [Results] Results (fine-tuning experiment): The claim that LLMs 'continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants' is central, yet no details are given on fine-tuning dataset size, procedure, hyperparameters, or how second-order outputs were evaluated post-fine-tuning. This prevents assessment of whether the hallucination effect is robust or an artifact of insufficient adaptation.

Authors: We agree that these implementation details are necessary for evaluating the fine-tuning result. We will expand the relevant Results subsection to report the exact size of the fine-tuning dataset, the full procedure (including data splitting and formatting), all hyperparameters used, and the evaluation protocol for post-fine-tuning second-order responses. This will clarify that the hallucination persisted under the reported adaptation conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of response distributions

full rationale

The paper conducts an empirical study eliciting first- and second-order responses from humans and LLMs to abstract art and Rorschach blots, then compares the resulting distributions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim of 'stereotype hallucination' rests on observed differences in response structure rather than any reduction to prior inputs by construction. The assumption that stimuli lack pre-established meanings is stated as a design rationale but is not part of a derivation chain that collapses the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that the chosen stimuli carry no pre-existing cultural associations and on standard experimental psychology assumptions about response elicitation; no free parameters or independently evidenced invented entities are introduced.

axioms (1)

domain assumption Abstract art and Rorschach blots lack pre-established cultural meanings.
This premise is invoked to justify that observed differences reflect internal processes rather than retrieval of known associations.

invented entities (1)

stereotype hallucination no independent evidence
purpose: Label for the process in which LLMs produce second-order stereotypes unrelated to first-order responses or human data.
New descriptive term introduced for the observed mismatch; no independent falsifiable prediction or external evidence is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5830 in / 1373 out tokens · 41237 ms · 2026-07-01T00:55:07.518704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Addison-Wesley Pub

Gordon Allport.The Nature of Prejudice. Addison-Wesley Pub. Co., 1954

1954
[2]

A categorization approach to stereotyping

Shelley E Taylor. A categorization approach to stereotyping. InCognitive processes in stereo- typing and intergroup behavior, pages 83–114. Psychology Press, 2015

2015
[3]

As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

Xuechunzi Bai, Miguel R Ramos, and Susan T Fiske. As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

2020
[4]

Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

Douglas Guilbeault, Solène Delecourt, Tasker Hull, Bhargav Srinivasa Desikan, Mark Chu, and Ethan Nadler. Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

2024
[5]

Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

Trenton D Mize. Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

2024
[6]

Douglas Guilbeault, Austin Van Loon, Katharina Lix, Amir Goldberg, and Sameer B Srivas- tava. Exposure to the views of opposing others with latent cognitive differences results in social influence—but only when those differences remain obscured.Management Science, 70 (10):6669–6684, 2024. 13

2024
[7]

Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

Austin Van Loon, Amir Goldberg, and Sameer B Srivastava. Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

2024
[8]

American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

Nicholas C Dias, Yphtach Lelkes, and Jacob Pearl. American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

2025
[9]

Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

2017
[10]

The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

2019
[11]

Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

Tessa ES Charlesworth, Aylin Caliskan, and Mahzarin R Banaji. Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

2022
[12]

Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

Tessa ES Charlesworth, Kshitish Ghate, Aylin Caliskan, and Mahzarin R Banaji. Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

2024
[13]

Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

Douglas Guilbeault, Solène Delecourt, and Bhargav Srinivasa Desikan. Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

2025
[14]

Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

AliahZewail, AlexandraFigueroa, JesseGraham, andMohammadAtari. Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

2026
[15]

How are llms mitigating stereotyping harms? learning from search engine studies

Alina Leidinger and Richard Rogers. How are llms mitigating stereotyping harms? learning from search engine studies. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 839–854, 2024

2024
[16]

Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

Esther Boissin, Thomas H Costello, Daniel Spinoza-Martín, David G Rand, and Gordon Pen- nycook. Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

2025
[17]

Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

Thomas H Costello, Gordon Pennycook, and David G Rand. Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

2024
[18]

Social simulacra: Creating populated prototypes for social computing systems

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. InProceedings of the 35th annual ACM symposium on user interface software and technology, pages 1–18, 2022

2022
[19]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023
[20]

The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023

Yiting Chen, Tracy Xiao Liu, You Shan, and Songfa Zhong. The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023. 14

2023
[21]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Mered- ith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Predicting results of social science experiments using large language models.Preprint, 2024

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

2024
[23]

Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

Austin C Kozlowski and James Evans. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

2025
[24]

In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

Austin C Kozlowski, Hyunku Kwon, and James A Evans. In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

work page arXiv 2024
[25]

Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

Apostolos Filippas, John J Horton, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

2024
[26]

Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

Steven T Piantadosi, Dyana CY Muller, Joshua S Rule, Karthikeya Kaushik, Mark Gorenstein, Elena R Leib, and Emily Sanford. Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

2024
[27]

How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

Anil Ananthaswamy. How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

2024
[28]

Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

MassimoStella, ThomasTHills, andYoedNKenett. Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

2023
[29]

Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

Ethan O Nadler, Elise Darragh-Ford, Bhargav Srinivasa Desikan, Christian Conaway, Mark Chu, Tasker Hull, and Douglas Guilbeault. Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

2023
[30]

Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

Jeffrey S Bowers, Gaurav Malhotra, Marin Dujmović, Milton Llera Montero, Christian Tsvetkov, Valerio Biscione, Guillermo Puebla, Federico Adolfi, John E Hummel, Rachel F Heaton, et al. Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

2023
[31]

Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

Ethan O Nadler, Douglas Guilbeault, Sofronia M Ringold, TR Williamson, Antoine Bellemare- Pepin, Iulia M Coms,a, Karim Jerbi, Srini Narayanan, and Lisa Aziz-Zadeh. Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

2025
[32]

Abstract understanding of core-knowledge concepts: Humans vs

Alessandro B Palmarini and Melanie Mitchell. Abstract understanding of core-knowledge concepts: Humans vs. llms. InICML 2024 Workshop on LLMs and Cognition, 2024

2024
[33]

Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Raja- manickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

work page arXiv 2025
[34]

The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023

Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023. 15

2023
[35]

Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

2023
[36]

Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

2025
[37]

Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026

Yueqi Xie, Lemeng Liang, Shuzhen Li, Yifu Lu, Zhiwen Xiao, Mengdi Shi, Junming Huang, Mengdi Wang, and Yu Xie. Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026. doi: 10. 1073/pnas.2538145123. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2538145123. _eprint: h...

work page doi:10.1073/pnas.2538145123 2026
[38]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms

SimoneBalloccu, PatríciaSchmidtová, MateuszLango, andOndřejDušek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93, 2024

2024
[39]

Investigating how pre-training data leakage affects models’ reproduction and detection capabilities

Masahiro Kaneko and Timothy Baldwin. Investigating how pre-training data leakage affects models’ reproduction and detection capabilities. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23556–23566, 2025

2025
[40]

What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

Charles P Davis, Hannah M Morrow, and Gary Lupyan. What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

2019
[41]

Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

Douglas Guilbeault, Andrea Baronchelli, and Damon Centola. Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

2021
[42]

The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024

Amir Goldberg and Madison H Singell. The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024. 16

2024

[1] [1]

Addison-Wesley Pub

Gordon Allport.The Nature of Prejudice. Addison-Wesley Pub. Co., 1954

1954

[2] [2]

A categorization approach to stereotyping

Shelley E Taylor. A categorization approach to stereotyping. InCognitive processes in stereo- typing and intergroup behavior, pages 83–114. Psychology Press, 2015

2015

[3] [3]

As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

Xuechunzi Bai, Miguel R Ramos, and Susan T Fiske. As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

2020

[4] [4]

Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

Douglas Guilbeault, Solène Delecourt, Tasker Hull, Bhargav Srinivasa Desikan, Mark Chu, and Ethan Nadler. Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

2024

[5] [5]

Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

Trenton D Mize. Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

2024

[6] [6]

Douglas Guilbeault, Austin Van Loon, Katharina Lix, Amir Goldberg, and Sameer B Srivas- tava. Exposure to the views of opposing others with latent cognitive differences results in social influence—but only when those differences remain obscured.Management Science, 70 (10):6669–6684, 2024. 13

2024

[7] [7]

Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

Austin Van Loon, Amir Goldberg, and Sameer B Srivastava. Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

2024

[8] [8]

American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

Nicholas C Dias, Yphtach Lelkes, and Jacob Pearl. American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

2025

[9] [9]

Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

2017

[10] [10]

The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

2019

[11] [11]

Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

Tessa ES Charlesworth, Aylin Caliskan, and Mahzarin R Banaji. Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

2022

[12] [12]

Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

Tessa ES Charlesworth, Kshitish Ghate, Aylin Caliskan, and Mahzarin R Banaji. Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

2024

[13] [13]

Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

Douglas Guilbeault, Solène Delecourt, and Bhargav Srinivasa Desikan. Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

2025

[14] [14]

Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

AliahZewail, AlexandraFigueroa, JesseGraham, andMohammadAtari. Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

2026

[15] [15]

How are llms mitigating stereotyping harms? learning from search engine studies

Alina Leidinger and Richard Rogers. How are llms mitigating stereotyping harms? learning from search engine studies. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 839–854, 2024

2024

[16] [16]

Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

Esther Boissin, Thomas H Costello, Daniel Spinoza-Martín, David G Rand, and Gordon Pen- nycook. Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

2025

[17] [17]

Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

Thomas H Costello, Gordon Pennycook, and David G Rand. Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

2024

[18] [18]

Social simulacra: Creating populated prototypes for social computing systems

Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. InProceedings of the 35th annual ACM symposium on user interface software and technology, pages 1–18, 2022

2022

[19] [19]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

2023

[20] [20]

The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023

Yiting Chen, Tracy Xiao Liu, You Shan, and Songfa Zhong. The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023. 14

2023

[21] [21]

LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Mered- ith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Predicting results of social science experiments using large language models.Preprint, 2024

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

2024

[23] [23]

Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

Austin C Kozlowski and James Evans. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

2025

[24] [24]

In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

Austin C Kozlowski, Hyunku Kwon, and James A Evans. In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

work page arXiv 2024

[25] [25]

Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

Apostolos Filippas, John J Horton, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

2024

[26] [26]

Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

Steven T Piantadosi, Dyana CY Muller, Joshua S Rule, Karthikeya Kaushik, Mark Gorenstein, Elena R Leib, and Emily Sanford. Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

2024

[27] [27]

How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

Anil Ananthaswamy. How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

2024

[28] [28]

Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

MassimoStella, ThomasTHills, andYoedNKenett. Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

2023

[29] [29]

Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

Ethan O Nadler, Elise Darragh-Ford, Bhargav Srinivasa Desikan, Christian Conaway, Mark Chu, Tasker Hull, and Douglas Guilbeault. Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

2023

[30] [30]

Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

Jeffrey S Bowers, Gaurav Malhotra, Marin Dujmović, Milton Llera Montero, Christian Tsvetkov, Valerio Biscione, Guillermo Puebla, Federico Adolfi, John E Hummel, Rachel F Heaton, et al. Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

2023

[31] [31]

Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

Ethan O Nadler, Douglas Guilbeault, Sofronia M Ringold, TR Williamson, Antoine Bellemare- Pepin, Iulia M Coms,a, Karim Jerbi, Srini Narayanan, and Lisa Aziz-Zadeh. Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

2025

[32] [32]

Abstract understanding of core-knowledge concepts: Humans vs

Alessandro B Palmarini and Melanie Mitchell. Abstract understanding of core-knowledge concepts: Humans vs. llms. InICML 2024 Workshop on LLMs and Cognition, 2024

2024

[33] [33]

Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Raja- manickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

work page arXiv 2025

[34] [34]

The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023

Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023. 15

2023

[35] [35]

Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

2023

[36] [36]

Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

2025

[37] [37]

Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026

Yueqi Xie, Lemeng Liang, Shuzhen Li, Yifu Lu, Zhiwen Xiao, Mengdi Shi, Junming Huang, Mengdi Wang, and Yu Xie. Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026. doi: 10. 1073/pnas.2538145123. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2538145123. _eprint: h...

work page doi:10.1073/pnas.2538145123 2026

[38] [38]

Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms

SimoneBalloccu, PatríciaSchmidtová, MateuszLango, andOndřejDušek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93, 2024

2024

[39] [39]

Investigating how pre-training data leakage affects models’ reproduction and detection capabilities

Masahiro Kaneko and Timothy Baldwin. Investigating how pre-training data leakage affects models’ reproduction and detection capabilities. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23556–23566, 2025

2025

[40] [40]

What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

Charles P Davis, Hannah M Morrow, and Gary Lupyan. What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

2019

[41] [41]

Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

Douglas Guilbeault, Andrea Baronchelli, and Damon Centola. Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

2021

[42] [42]

The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024

Amir Goldberg and Madison H Singell. The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024. 16

2024