pith. sign in

arxiv: 2606.30945 · v1 · pith:5CKP46LKnew · submitted 2026-06-29 · 💻 cs.CY

Free-form Association Tasks Reveal Stereotype Hallucination in Large Language Models

Pith reviewed 2026-07-01 00:55 UTC · model grok-4.3

classification 💻 cs.CY
keywords stereotype hallucinationlarge language modelsfirst-order responsessecond-order predictionsabstract stimuliRorschach blotssocial groupsfine-tuning
0
0 comments X

The pith

LLMs generate stark second-order stereotypes that neither amplify their own first-order responses nor match actual human group differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares how humans and large language models interpret abstract art and Rorschach blots, stimuli chosen because they lack fixed cultural meanings. Humans produce varied first-order interpretations across individuals and only moderately amplify group patterns when asked to predict how others will respond. LLMs instead give uniform first-order answers yet produce strong second-order stereotypes for groups defined by gender, partisanship, personality, and other domains; these stereotypes do not build on the models' own responses and do not track measured human differences. The same pattern appears even after the models are fine-tuned directly on participant data. The results point to limits on using LLMs to predict or simulate human judgments in open-ended situations.

Core claim

In free-form association tasks with abstract art and Rorschach blots, humans display heterogeneous first-order responses with minimal group structure and engage in moderate stereotype exaggeration when making second-order predictions about social groups. LLMs produce homogeneous first-order responses yet generate stark second-order stereotypes that neither amplify their first-order tendencies nor reflect actual human group differences, a process the authors term stereotype hallucination. This hallucination persists when the models are fine-tuned on response data collected from actual participants, indicating that LLMs do not emulate the cognitive processes underlying human stereotypes in nov

What carries the argument

The contrast between first-order personal interpretations and second-order predictions of group responses to abstract stimuli that lack pre-established cultural meanings, which isolates stereotype hallucination as the generation of ungrounded group stereotypes by LLMs.

If this is right

  • LLMs cannot be treated as reliable simulators of the cognitive processes that produce human stereotypes.
  • Fine-tuning on human response data does not remove the generation of ungrounded second-order stereotypes.
  • LLMs have significant limitations when used to model or predict human behavior in contexts that involve diverse or novel interpretations.
  • Stereotype hallucination constitutes a distinct mechanism from the stereotype exaggeration observed in humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same first-order versus second-order mismatch could appear in other open-ended tasks such as story completion or image captioning when group predictions are requested.
  • Applications that rely on LLMs to anticipate public reactions to ambiguous content may systematically overstate social divisions.
  • Designers could probe for hallucination by comparing a model's direct associations with its forecasts about demographic subgroups on new abstract inputs.

Load-bearing premise

Abstract art and Rorschach blots lack pre-established cultural meanings, so differences between human and LLM responses can be attributed to hallucination rather than retrieval of learned associations.

What would settle it

Finding that LLMs' second-order stereotypes in these tasks closely match either their own first-order responses or the actual measured differences across human social groups would falsify the claim of stereotype hallucination.

Figures

Figures reproduced from arXiv: 2606.30945 by Amir Goldberg, Douglas Guilbeault, Xinrui Chloe Zhao.

Figure 1
Figure 1. Figure 1: Experimental design comparing human and LLM associative processes across first-order and second-order tasks. The survey pipeline (top) shows human participants randomly assigned to either first￾order tasks (n=300, providing direct interpretations of 4 random stimuli) or second-order tasks (n=300 per domain, predicting group interpretations for 2 random stimuli for one of the five social domains). The LLM p… view at source ↗
Figure 2
Figure 2. Figure 2: Classification methodology for measuring between-group separability of associative patterns. Word associations (3 objects, 3 concepts, 1 emotion) from each social domain by each respondent are fed into a standardized prompt asking LLM classifiers to determine group membership based solely on linguistic patterns. Both GPT-4o mini and Llama-3.2-11B-Vision-Instruct serve as independent classifiers to validate… view at source ↗
Figure 3
Figure 3. Figure 3: Entropy of word association distributions with bootstrap error estimation by order, type of association, domain, and type of respondent. This figure displays the entropy values (displayed on the x-axis) of associations distributions for responses of human participants (blue circles, labeled as Human), GPT-4o mini (orange squares, labeled as GPT) and Llama-3.2-11B-Vision-Instruct (green diamonds, labeled as… view at source ↗
Figure 4
Figure 4. Figure 4: Semantic space analysis reveals LLM stereotype amplification. (A) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Similarity analysis across domains for combined associations with fuzzy matching [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Recent studies argue that LLMs can predict human stereotypical judgments. Yet whether LLMs emulate the cognitive processes underlying human stereotypes, or merely retrieve learned associations to solve prediction tasks, remains unclear. Prior work examines LLMs' stereotypes in either (i) controlled judgment tasks like multiple choice surveys, or (ii) contexts constrained by conventionalized and predictable group biases. Here, we compare the structure of the stereotypes that humans and LLMs exhibit in the interpretation of free-form stimuli, namely abstract art and Rorschach blots, which lack pre-established cultural meanings. We recruit participants across five social domains (gender, partisanship, personality, urbanicity, and lifestyle) and elicit both first-order (direct personal interpretations) and second-order responses (predictions about how members of social groups will interpret the stimuli); we replicate this design with two multimodal models (GPT-4o mini and Llama-3.2-11B-Vision-Instruct). Humans and LLMs differ not only in magnitude but in the qualitative nature of their stereotypes. Human first-order responses display heterogeneity with minimal group structure. When predicting group responses, humans engage in "stereotype exaggeration" by moderately amplifying first-order tendencies while preserving diversity. By contrast, LLMs exhibit homogeneous first-order responses, and yet generate stark second-order stereotypes that neither amplify existing first-order tendencies nor reflect actual human group differences, a process we term "stereotype hallucination." LLMs continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants. These findings suggest significant limitations in the use of LLMs to model and predict human behavior in novel contexts involving diverse interpretations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper compares human and LLM (GPT-4o mini, Llama-3.2-11B-Vision-Instruct) responses on free-form association tasks using abstract art and Rorschach blots across five domains (gender, partisanship, personality, urbanicity, lifestyle). Humans exhibit heterogeneous first-order interpretations and moderate 'stereotype exaggeration' in second-order group predictions; LLMs show homogeneous first-order responses yet generate stark second-order stereotypes unrelated to first-order patterns or actual human differences, termed 'stereotype hallucination,' which persists after fine-tuning on human data. The design uses stimuli argued to lack pre-established cultural meanings to isolate generative processes from retrieval.

Significance. If robust, the work offers a useful empirical distinction between human stereotype exaggeration and LLM stereotype hallucination in novel contexts, highlighting limits on LLMs as models of human social cognition. The free-form, non-conventionalized stimuli and direct first-/second-order comparison are strengths; the persistence after fine-tuning is a notable negative result.

major comments (3)
  1. [Introduction/Methods] Introduction and Methods: The central distinction between 'stereotype hallucination' and retrieval of learned associations rests on the claim that abstract art and Rorschach blots 'lack pre-established cultural meanings.' No check for the presence of these exact images (or close visual analogs) in web-scale pretraining corpora, psychology literature, or art datasets is reported; without such verification the attribution of structured second-order outputs to hallucination rather than statistical regularities remains unestablished.
  2. [Methods] Methods: The abstract describes a comparative design with humans and two models across five domains but supplies no sample sizes, number of stimuli per participant, exact prompt wording, statistical tests for distribution comparisons, or controls for order/presentation effects. These details are load-bearing for evaluating whether the reported qualitative differences (homogeneous vs. heterogeneous first-order responses) are reliable.
  3. [Results] Results (fine-tuning experiment): The claim that LLMs 'continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants' is central, yet no details are given on fine-tuning dataset size, procedure, hyperparameters, or how second-order outputs were evaluated post-fine-tuning. This prevents assessment of whether the hallucination effect is robust or an artifact of insufficient adaptation.
minor comments (2)
  1. [Methods] Clarify operational definitions of 'first-order' and 'second-order' responses and how homogeneity vs. heterogeneity was quantified (e.g., entropy, variance metrics) in the Methods section.
  2. Figure captions and tables should report exact participant and stimulus counts, confidence intervals, and p-values for all human-LLM comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where additional clarity will strengthen the manuscript. We address each major comment below, indicating revisions where the manuscript will be updated to incorporate the feedback.

read point-by-point responses
  1. Referee: [Introduction/Methods] Introduction and Methods: The central distinction between 'stereotype hallucination' and retrieval of learned associations rests on the claim that abstract art and Rorschach blots 'lack pre-established cultural meanings.' No check for the presence of these exact images (or close visual analogs) in web-scale pretraining corpora, psychology literature, or art datasets is reported; without such verification the attribution of structured second-order outputs to hallucination rather than statistical regularities remains unestablished.

    Authors: We acknowledge that an exhaustive verification of these specific images across all pretraining corpora was not performed, as such a search is computationally prohibitive at web scale. The stimuli were selected from established psychological instruments (Rorschach blots) and abstract art explicitly chosen for their documented ambiguity and absence of fixed cultural referents in the relevant literature. We will revise the Methods section to expand the stimulus selection rationale, cite supporting references on the interpretive openness of these materials, and explicitly note the limitation regarding direct corpus checks while arguing that the free-form task design isolates generative processes from simple retrieval. revision: partial

  2. Referee: [Methods] Methods: The abstract describes a comparative design with humans and two models across five domains but supplies no sample sizes, number of stimuli per participant, exact prompt wording, statistical tests for distribution comparisons, or controls for order/presentation effects. These details are load-bearing for evaluating whether the reported qualitative differences (homogeneous vs. heterogeneous first-order responses) are reliable.

    Authors: The full manuscript contains these details, but we agree they should be more prominently and systematically presented. We will revise the Methods section to include explicit reporting of human sample sizes, stimuli counts per domain and participant, verbatim prompt templates, the specific statistical tests employed for comparing response distributions, and procedures used to counterbalance order and presentation effects. A new subsection on experimental controls will be added. revision: yes

  3. Referee: [Results] Results (fine-tuning experiment): The claim that LLMs 'continued to hallucinate stereotypes even when fine-tuned on the response data of actual participants' is central, yet no details are given on fine-tuning dataset size, procedure, hyperparameters, or how second-order outputs were evaluated post-fine-tuning. This prevents assessment of whether the hallucination effect is robust or an artifact of insufficient adaptation.

    Authors: We agree that these implementation details are necessary for evaluating the fine-tuning result. We will expand the relevant Results subsection to report the exact size of the fine-tuning dataset, the full procedure (including data splitting and formatting), all hyperparameters used, and the evaluation protocol for post-fine-tuning second-order responses. This will clarify that the hallucination persisted under the reported adaptation conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of response distributions

full rationale

The paper conducts an empirical study eliciting first- and second-order responses from humans and LLMs to abstract art and Rorschach blots, then compares the resulting distributions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim of 'stereotype hallucination' rests on observed differences in response structure rather than any reduction to prior inputs by construction. The assumption that stimuli lack pre-established meanings is stated as a design rationale but is not part of a derivation chain that collapses the result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that the chosen stimuli carry no pre-existing cultural associations and on standard experimental psychology assumptions about response elicitation; no free parameters or independently evidenced invented entities are introduced.

axioms (1)
  • domain assumption Abstract art and Rorschach blots lack pre-established cultural meanings.
    This premise is invoked to justify that observed differences reflect internal processes rather than retrieval of known associations.
invented entities (1)
  • stereotype hallucination no independent evidence
    purpose: Label for the process in which LLMs produce second-order stereotypes unrelated to first-order responses or human data.
    New descriptive term introduced for the observed mismatch; no independent falsifiable prediction or external evidence is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5830 in / 1373 out tokens · 41237 ms · 2026-07-01T00:55:07.518704+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Addison-Wesley Pub

    Gordon Allport.The Nature of Prejudice. Addison-Wesley Pub. Co., 1954

  2. [2]

    A categorization approach to stereotyping

    Shelley E Taylor. A categorization approach to stereotyping. InCognitive processes in stereo- typing and intergroup behavior, pages 83–114. Psychology Press, 2015

  3. [3]

    As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

    Xuechunzi Bai, Miguel R Ramos, and Susan T Fiske. As diversity increases, people paradoxi- cally perceive social groups as more similar.Proceedings of the National Academy of Sciences, 117(23):12741–12749, 2020

  4. [4]

    Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

    Douglas Guilbeault, Solène Delecourt, Tasker Hull, Bhargav Srinivasa Desikan, Mark Chu, and Ethan Nadler. Online images amplify gender bias.Nature, 626(8001):1049–1055, 2024

  5. [5]

    Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

    Trenton D Mize. Divergence and convergence across presumed and actual stereotypes.Socius, 10:23780231241286873, 2024

  6. [6]

    Douglas Guilbeault, Austin Van Loon, Katharina Lix, Amir Goldberg, and Sameer B Srivas- tava. Exposure to the views of opposing others with latent cognitive differences results in social influence—but only when those differences remain obscured.Management Science, 70 (10):6669–6684, 2024. 13

  7. [7]

    Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

    Austin Van Loon, Amir Goldberg, and Sameer B Srivastava. Imagined otherness fuels blatant dehumanization of outgroups.Communications psychology, 2(1):39, 2024

  8. [8]

    American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

    Nicholas C Dias, Yphtach Lelkes, and Jacob Pearl. American partisans vastly under-estimate the diversity of other partisans’ policy attitudes.Political Science Research and Methods, 13 (3):725–735, 2025

  9. [9]

    Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

    Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases.Science, 356(6334):183–186, 2017

  10. [10]

    The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

    Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

  11. [11]

    Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

    Tessa ES Charlesworth, Aylin Caliskan, and Mahzarin R Banaji. Historical representations of social groups across 200 years of word embeddings from google books.Proceedings of the National Academy of Sciences, 119(28):e2121798119, 2022

  12. [12]

    Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

    Tessa ES Charlesworth, Kshitish Ghate, Aylin Caliskan, and Mahzarin R Banaji. Extracting intersectional stereotypes from embeddings: Developing and validating the flexible intersec- tional stereotype extraction procedure.PNAS nexus, 3(3):pgae089, 2024

  13. [13]

    Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

    Douglas Guilbeault, Solène Delecourt, and Bhargav Srinivasa Desikan. Age and gender dis- tortion in online media and large language models.Nature, 646(8087):1129–1137, 2025

  14. [14]

    Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

    AliahZewail, AlexandraFigueroa, JesseGraham, andMohammadAtari. Moralstereotypingin large language models.Proceedings of the National Academy of Sciences, 123(10):e2519941123, 2026

  15. [15]

    How are llms mitigating stereotyping harms? learning from search engine studies

    Alina Leidinger and Richard Rogers. How are llms mitigating stereotyping harms? learning from search engine studies. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 839–854, 2024

  16. [16]

    Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

    Esther Boissin, Thomas H Costello, Daniel Spinoza-Martín, David G Rand, and Gordon Pen- nycook. Dialogues with large language models reduce conspiracy beliefs even when the ai is perceived as human.PNAS nexus, 4(11):pgaf325, 2025

  17. [17]

    Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

    Thomas H Costello, Gordon Pennycook, and David G Rand. Durably reducing conspiracy beliefs through dialogues with ai.Science, 385(6714):eadq1814, 2024

  18. [18]

    Social simulacra: Creating populated prototypes for social computing systems

    Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. InProceedings of the 35th annual ACM symposium on user interface software and technology, pages 1–18, 2022

  19. [19]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  20. [20]

    The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023

    Yiting Chen, Tracy Xiao Liu, You Shan, and Songfa Zhong. The emergence of economic rationality of gpt.Proceedings of the National Academy of Sciences, 120(51):e2316205120, 2023. 14

  21. [21]

    LLM Agents Grounded in Self-Reports Enable General-Purpose Simulation of Individuals

    Joon Sung Park, Carolyn Q Zou, Aaron Shaw, Benjamin Mako Hill, Carrie Cai, Mered- ith Ringel Morris, Robb Willer, Percy Liang, and Michael S Bernstein. Generative agent simulations of 1,000 people.arXiv preprint arXiv:2411.10109, 2024

  22. [22]

    Predicting results of social science experiments using large language models.Preprint, 2024

    Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. Predicting results of social science experiments using large language models.Preprint, 2024

  23. [23]

    Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

    Austin C Kozlowski and James Evans. Simulating subjects: The promise and peril of artificial intelligence stand-ins for social agents and interactions.Sociological Methods & Research, 54 (3):1017–1073, 2025

  24. [24]

    In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

    Austin C Kozlowski, Hyunku Kwon, and James A Evans. In silico sociology: forecasting covid-19 polarization with large language models.arXiv preprint arXiv:2407.11190, 2024

  25. [25]

    Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

    Apostolos Filippas, John J Horton, and Benjamin S Manning. Large language models as simulated economic agents: What can we learn from homo silicus? InProceedings of the 25th ACM Conference on Economics and Computation, pages 614–615, 2024

  26. [26]

    Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

    Steven T Piantadosi, Dyana CY Muller, Joshua S Rule, Karthikeya Kaushik, Mark Gorenstein, Elena R Leib, and Emily Sanford. Why concepts are (probably) vectors.Trends in Cognitive Sciences, 28(9):844–856, 2024

  27. [27]

    How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

    Anil Ananthaswamy. How close is ai to human-level intelligence.Nature, 636(8041):22–25, 2024

  28. [28]

    Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

    MassimoStella, ThomasTHills, andYoedNKenett. Usingcognitivepsychologytounderstand gpt-like models needs to extend beyond human biases.Proceedings of the National Academy of Sciences, 120(43):e2312911120, 2023

  29. [29]

    Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

    Ethan O Nadler, Elise Darragh-Ford, Bhargav Srinivasa Desikan, Christian Conaway, Mark Chu, Tasker Hull, and Douglas Guilbeault. Divergences in color perception between deep neural networks and humans.Cognition, 241:105621, 2023

  30. [30]

    Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

    Jeffrey S Bowers, Gaurav Malhotra, Marin Dujmović, Milton Llera Montero, Christian Tsvetkov, Valerio Biscione, Guillermo Puebla, Federico Adolfi, John E Hummel, Rachel F Heaton, et al. Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46:e385, 2023

  31. [31]

    Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

    Ethan O Nadler, Douglas Guilbeault, Sofronia M Ringold, TR Williamson, Antoine Bellemare- Pepin, Iulia M Coms,a, Karim Jerbi, Srini Narayanan, and Lisa Aziz-Zadeh. Statistical or embodied? comparing colorseeing, colorblind, painters, and large language models in their processing of color metaphors.Cognitive Science, 49(7):e70083, 2025

  32. [32]

    Abstract understanding of core-knowledge concepts: Humans vs

    Alessandro B Palmarini and Melanie Mitchell. Abstract understanding of core-knowledge concepts: Humans vs. llms. InICML 2024 Workshop on LLMs and Cognition, 2024

  33. [33]

    Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

    Claas Beger, Ryan Yi, Shuhao Fu, Arseny Moskvichev, Sarah W Tsai, Sivasankaran Raja- manickam, and Melanie Mitchell. Do ai models perform human-like abstract reasoning across modalities?arXiv preprint arXiv:2510.02125, 2025

  34. [34]

    The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023

    Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023. 15

  35. [35]

    Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

    Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne-Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

  36. [36]

    Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

    Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large ai models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

  37. [37]

    Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026

    Yueqi Xie, Lemeng Liang, Shuzhen Li, Yifu Lu, Zhiwen Xiao, Mengdi Shi, Junming Huang, Mengdi Wang, and Yu Xie. Evaluating the statistical realism of LLM-generated social science data.Proceedings of the National Academy of Sciences, 123(19):e2538145123, 2026. doi: 10. 1073/pnas.2538145123. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2538145123. _eprint: h...

  38. [38]

    Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms

    SimoneBalloccu, PatríciaSchmidtová, MateuszLango, andOndřejDušek. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source llms. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 67–93, 2024

  39. [39]

    Investigating how pre-training data leakage affects models’ reproduction and detection capabilities

    Masahiro Kaneko and Timothy Baldwin. Investigating how pre-training data leakage affects models’ reproduction and detection capabilities. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23556–23566, 2025

  40. [40]

    What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

    Charles P Davis, Hannah M Morrow, and Gary Lupyan. What does a horgous look like? nonsense words elicit meaningful drawings.Cognitive Science, 43(10):e12791, 2019

  41. [41]

    Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

    Douglas Guilbeault, Andrea Baronchelli, and Damon Centola. Experimental evidence for scale-induced category convergence across populations.Nature communications, 12(1):327, 2021

  42. [42]

    The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024

    Amir Goldberg and Madison H Singell. The sociology of interpretation.Annual Review of Sociology, 50:80–105, 2024. 16