pith. sign in

arxiv: 2511.11435 · v3 · submitted 2025-11-14 · 💻 cs.CV · cs.AI

The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models

Pith reviewed 2026-05-17 21:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion modelstext-to-image generationmultimodal iconicitycultural referencesmemorization evaluationCRT metricrecognition and realizationprompt perturbation
0
0 comments X

The pith

Diffusion models respond to culturally iconic prompts through separate processes of recognizing a reference and then realizing it via replication or reinterpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how text-to-image diffusion models handle prompts that draw on shared cultural visual references, such as famous artworks or film scenes. It introduces the Cultural Reference Transformation metric to measure whether a model evokes the reference at all and then how it chooses to depict it. Tests across five models and 767 references show clear differences: some models recognize fewer references while others copy visuals more closely. Prompt changes like synonym swaps still lead models to reproduce iconic structures, and recognition ties to training data frequency along with reference popularity and age. The work argues that evaluation must move past simple text-image matching to capture this contextual behavior.

Core claim

The behavior of diffusion models in culturally iconic settings cannot be reduced to simple reproduction but depends on how references are recognized and realized. The Cultural Reference Transformation metric isolates these two dimensions and reveals model-specific patterns when applied to 767 Wikidata-derived references spanning still and moving imagery. Recognition further correlates with training data frequency, textual uniqueness, reference popularity, and creation date, while prompt perturbations demonstrate persistent reproduction of iconic visual structures.

What carries the argument

The Cultural Reference Transformation (CRT) metric, which separates recognition of a cultural reference from its realization through replication or reinterpretation.

If this is right

  • Some diffusion models exhibit weaker recognition of cultural references while others rely more on visual replication when handling iconic prompts.
  • Models continue to reproduce iconic visual structures even after textual cues are altered through synonym substitutions or literal image descriptions.
  • Recognition strength correlates with training data frequency, textual uniqueness, reference popularity, and creation date.
  • Evaluation of text-to-image models must account for both recognition and realization rather than relying on basic matching metrics alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be applied to other generative systems to detect similar cultural grounding patterns in outputs.
  • Training procedures might explicitly reward reinterpretation over replication to reduce unwanted visual copying of popular references.
  • Future benchmarks could include controlled variations in reference popularity to isolate its effect on model behavior.

Load-bearing premise

The 767 Wikidata-derived cultural references form a representative sample of multimodal iconicity and the CRT metric isolates recognition from realization without model-specific or reference-specific confounds.

What would settle it

Running the same evaluation on a fresh set of cultural references matched for frequency but drawn from non-iconic sources and finding identical recognition rates across models would indicate the patterns are not driven by cultural specificity.

Figures

Figures reproduced from arXiv: 2511.11435 by Eva Cetinic, Maria-Teresa De Rosa Palmini.

Figure 1
Figure 1. Figure 1: Example generations from Stable Diffusion XL for culturally iconic references like [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recognition vs. Transformation in Multimodal Iconicity. Generations from three diffusion models illustrate how they respond to the prompt “The Dark Side of the Moon.” The vertical axis (Recognition) indicates whether the model evokes the intended cultural reference, while the horizontal axis (Transforma￾tion) reflects the degree of visual reinterpretation. multimodal iconicity. Such iconic pairs are charac… view at source ↗
Figure 3
Figure 3. Figure 3: Framework for evaluating multimodal iconicity. Cultural reference prompts generate images evaluated along two dimensions: Recognition (CRA), measuring alignment with reference images via CLIP, and Realization (VI), measuring how independently the model recreates them using DINOv3 patch analysis. The resulting Cultural Reference Transformation (CRT) metric captures both a model’s ability to identify cultura… view at source ↗
Figure 4
Figure 4. Figure 4: CRA–VR relationship across diffusion models for static (top) and dynamic (bottom) cultural references. Each point corresponds [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of images generated from the prompt Phys￾ical Graffiti using three diffusion models, SDXL, SD3, and Flux Schnell, shown alongside the iconic cultural reference image. onym variant) introduces minimal prompt change by re￾placing a key content word of the iconic title with a semanti￾cally close synonym (e.g., "The Shriek" for "The Scream"). The second (literal description) replaces the title with an … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples from the prompt perturbation experiments. For both static ( [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (∆CRT — Retained recognition). Mean change in Cultural Reference Transformation (CRT) after textual perturba￾tions (synonym and description) computed over the subset of cul￾tural references that remained recognized before and after pertur￾bation. Error bars denote 95% confidence intervals. largest share across both variants, indicating that diffusion models can reproduce iconic visuals despite altered ling… view at source ↗
Figure 8
Figure 8. Figure 8: Strongest correlates of CRA as a function of the number of deduplicated text–image pairs. Each scatterplot shows how CRA varies with creation date, image memorability, and text uniqueness (static and dynamic) as a function of the number of deduplicated text–image pairs. Points are colored by CRA, and median splits along both axes define quadrants annotated with average CRA values [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of low text uniqueness cultural references with no alignment in SD v2.1. Iconic references (top) and cor￾responding generations (bottom). All shown references have text uniqueness below 0.1 and exhibit near-zero CRA. retrieve memorized data points in diffusion models [32]. Text concreteness also shows a weak positive correlation (ρ = 0.16),suggesting a minor trend in which more abstract titles hin… view at source ↗
Figure 11
Figure 11. Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Modality Distribution. Concepts span six major cul￾tural domains, ensuring balanced representation across artistic and media forms. ities between reference images belonging to the same cul￾tural reference versus different ones. The resulting distribu￾tions were well separated (µsame ≈ 0.85, µdiff ≈ 0.47). Set￾ting τ = 0.7 retained approximately 96% of true matches while keeping false positives below 1%, e… view at source ↗
Figure 13
Figure 13. Figure 13: CLIP similarity distributions. Cosine similarities between reference images of the same (blue) and different (red) cultural references. The separation supports the choice of τ = 0.7 for recognition alignment [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: DINOv3 similarity distributions. Cosine similarities between reference images of the same (blue) and unrelated (red) cultural references. The observed separation motivates the reuse threshold τreuse = 0.6 used for patch-level analysis. τreuse = 0.6 achieved F1 ≈ 0.98 with a false-positive rate below 0.1, effectively distinguishing local replication from unrelated variation. This value aligns with the repl… view at source ↗
Figure 15
Figure 15. Figure 15: PDFE overestimates replication while CRT reveals cultural transformation. Compositional coherence stems from learned iconic structure rather than direct visual reuse (Napoleon Crossing the Alps, Sacred and Profane Love, The Big Bang Theory) [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: PDFE underestimates cultural alignment: low PDFE despite high CRA/CRT. Models reproduce canonical iconography through transformation rather than replication (Saint Jerome in the Wilderness, American Gothic, The Walking Dead) [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: PDFE indicates moderate replication while CRA reveals lack of cultural alignment. Models generate superficially similar scenes without capturing the intended cultural reference (Madonna with the Long Neck, Portrait of Père Tanguy, Lost in Translation) [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Image-level distribution of visual reuse ( [PITH_FULL_IMAGE:figures/full_fig_p015_19.png] view at source ↗
Figure 18
Figure 18. Figure 18: CRA–CRC relationship across models for dynamic cultural references. References are grouped by CRA bins (1.0, 0.9, 0.8, etc.) and plotted against their average CRC values. F. Distribution of Visual Reuse [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt: Pillars of Creation. Models: Imagen 4 (left), Flux Schnell (center), SD3 (right). CRA: Imagen 4 = 1.0; Flux Schnell = 0.0; SD3 = 1.0. CRT: Imagen 4 = 0.00; Flux Schnell = 0.00; SD3 = 0.73 [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt: Nighthawks. Models: Imagen 4 (left), Flux Schnell (center), SDXL (right). CRA: Imagen 4 = 1.0; Flux Schnell = 0.0; SDXL = 1.0. CRT: Imagen 4 = 0.88; Flux Schnell = 0.00; SDXL = 0.19 [PITH_FULL_IMAGE:figures/full_fig_p016_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt: Lady with an Ermine. Models: Flux Schnell (left), Imagen 4 (center), SDXL (right). CRA: Flux Schnell = 0.0; Imagen 4 = 1.0; SDXL = 1.0. CRT: Flux Schnell = 0.00; Imagen 4 = 0.87; SDXL = 0.68 [PITH_FULL_IMAGE:figures/full_fig_p016_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt: Breakfast at Tiffany’s. Models: SDXL (left), Imagen 4 (center), Flux Schnell (right). CRA: SDXL = 1.0; Imagen 4 = 1.0; Flux Schnell = 1.0. CRT: SDXL = 0.98; Imagen 4 = 0.32; Flux Schnell = 0.96 [PITH_FULL_IMAGE:figures/full_fig_p017_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt: House of Cards. Models: Flux Schnell (left), SD3 (center), SDXL (right). CRA: Flux Schnell = 0.0; SD3 = 1.0; SDXL = 1.0. CRT: Flux Schnell = 0.00; SD3 = 0.49; SDXL = 0.95 [PITH_FULL_IMAGE:figures/full_fig_p017_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt: Breaking Bad. Models: Imagen 4 (left), SDXL (center), SD2 (right). CRA: Imagen 4 = 1.0; SDXL = 1.0; SD2 = 1.0. CRT: Imagen 4 = 0.59; SDXL = 0.87; SD2 = 0.21 [PITH_FULL_IMAGE:figures/full_fig_p017_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Change in Cultural Reference Alignment (∆CRA) under prompt perturbations. Mean change in ∆CRA un￾der synonym substitutions (solid bars) and literal descriptions (hatched bars), shown separately for static (a) and dynamic (b) references. J. Examples of Residual Duplicates in LAION [PITH_FULL_IMAGE:figures/full_fig_p018_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Examples of derivative reproductions of The Starry Night found in LAION. Even after near-duplicate removal, visu￾ally varied but semantically redundant products (e.g., shirts, mugs, posters) remain. K. Correlation Analysis As shown in Tab. 5, both static and dynamic concepts exhibit significant correlations between Cultural Refer￾ence Alignment (CRA) and several cultural and training￾data–related features… view at source ↗
read the original abstract

The ambiguity between generalization and memorization in TTI diffusion models becomes pronounced when prompts invoke culturally shared visual references, a phenomenon we term multimodal iconicity. These are instances in which images and texts reflect established cultural associations, such as when a title recalls a familiar artwork or film scene. Such cases challenge existing approaches to evaluating memorization, as they define a setting in which instance-level memorization and culturally grounded generalization are structurally intertwined. To address this challenge, we propose an evaluation framework to assess a model's ability to remain culturally grounded without relying on visual replication. Specifically, we introduce the Cultural Reference Transformation (CRT) metric, which separates two dimensions of model behavior: Recognition, whether a model evokes a reference, from Realization, how it depicts it through replication or reinterpretation. We evaluate five diffusion models on 767 Wikidata-derived cultural references, covering both still and moving imagery, and find differences in how they respond to multimodal iconicity: some show weaker recognition, while others rely more heavily on replication. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, we find that cultural reference recognition correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our findings show that the behavior of diffusion models in culturally iconic settings cannot be reduced to simple reproduction, but depends on how references are recognized and realized, advancing evaluation beyond simple text-image matching toward richer contextual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines multimodal iconicity in text-to-image diffusion models, where prompts invoke culturally shared visual references (e.g., artworks or film scenes). It introduces the Cultural Reference Transformation (CRT) metric to separate Recognition (whether the model evokes the reference) from Realization (replication versus reinterpretation), evaluates five models on 767 Wikidata-derived references, conducts synonym-substitution and literal-description prompt perturbations, and reports correlations between recognition and factors including training-data frequency, textual uniqueness, reference popularity, and creation date. The central claim is that model behavior under cultural iconicity cannot be reduced to simple reproduction but depends on these separable dimensions.

Significance. If the CRT metric validly isolates the two dimensions without model- or reference-specific confounds, the work strengthens evaluation of generative models by incorporating cultural grounding and contextual understanding beyond standard text-image similarity. The scale of the reference set, use of external Wikidata grounding, and perturbation experiments supply a reproducible empirical basis for comparing models on memorization-versus-generalization trade-offs.

major comments (2)
  1. [§3 (CRT Metric)] §3 (CRT Metric): Recognition and Realization are both scored from visual features of the identical generated images. This risks circularity because any model-specific generation bias or prompt sensitivity directly couples the two scores, undermining the claim that models differ independently on these dimensions and that prompt-perturbation results cleanly isolate linguistic sensitivity.
  2. [§4 (Evaluation and Results)] §4 (Evaluation and Results): The manuscript reports model differences, perturbation outcomes, and correlations with popularity/date but supplies no quantitative tables, error analysis, baseline comparisons, or statistical tests for the CRT scores. Without these, the magnitude and reliability of the reported differences cannot be assessed.
minor comments (2)
  1. [§3.1] Clarify the exact similarity thresholds, feature extractors, or decision rules used to operationalize Recognition versus Realization in the CRT definition.
  2. [Discussion] Add a limitations paragraph addressing potential selection bias in the 767 Wikidata references and any reference-specific confounds in the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which help strengthen the presentation of our evaluation framework. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 (CRT Metric)] §3 (CRT Metric): Recognition and Realization are both scored from visual features of the identical generated images. This risks circularity because any model-specific generation bias or prompt sensitivity directly couples the two scores, undermining the claim that models differ independently on these dimensions and that prompt-perturbation results cleanly isolate linguistic sensitivity.

    Authors: We appreciate the concern regarding potential coupling. Recognition is operationalized via binary detection of presence/absence of a small set of reference-specific visual attributes drawn from Wikidata (e.g., distinctive objects, composition, or color palette), while Realization is scored separately as a continuous deviation measure using a perceptual hash distance focused on overall visual fidelity to the source image. These use distinct feature sets and decision thresholds, allowing the two dimensions to vary independently across models and prompts. The perturbation experiments further support separability by showing that synonym changes often reduce Recognition scores while leaving Realization largely unchanged. We will expand §3 with explicit scoring formulas, attribute lists, and an ablation demonstrating that the two scores are not deterministically linked. revision: yes

  2. Referee: [§4 (Evaluation and Results)] §4 (Evaluation and Results): The manuscript reports model differences, perturbation outcomes, and correlations with popularity/date but supplies no quantitative tables, error analysis, baseline comparisons, or statistical tests for the CRT scores. Without these, the magnitude and reliability of the reported differences cannot be assessed.

    Authors: We agree that the current version lacks sufficient quantitative detail for readers to evaluate effect sizes and statistical reliability. The revised manuscript will include: (1) a table of mean CRT Recognition and Realization scores per model with standard deviations and 95% confidence intervals; (2) an error analysis breaking down false-positive and false-negative cases by reference type (artwork vs. film scene); (3) baseline comparisons against CLIP cosine similarity and LPIPS; and (4) Pearson correlations with p-values for the reported factors (training-data frequency, textual uniqueness, popularity, creation date). These additions will be placed in §4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical metric on external references

full rationale

The paper is an empirical evaluation study that introduces the CRT metric to separate recognition and realization on 767 Wikidata-derived cultural references. It relies on external data sources, prompt perturbation experiments, and generated image assessments rather than any derivation chain, fitted parameters, or self-referential equations that reduce reported outcomes to quantities defined within the paper itself. No load-bearing steps match the enumerated circularity patterns, and the central claims remain independent of internal self-definition or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the validity of the newly introduced CRT metric and the representativeness of the Wikidata cultural reference set; both are postulated in the paper without independent prior validation or falsifiable handles outside the current study.

axioms (1)
  • domain assumption Wikidata entries provide an unbiased and representative sample of culturally shared visual references
    Used to construct the 767-reference evaluation set covering still and moving imagery.
invented entities (2)
  • multimodal iconicity no independent evidence
    purpose: To name the structural intertwining of instance-level memorization and culturally grounded generalization in prompts
    Term introduced to frame the evaluation challenge; no independent evidence supplied.
  • Cultural Reference Transformation (CRT) metric no independent evidence
    purpose: To separate Recognition from Realization dimensions of model behavior on cultural references
    Newly proposed evaluation framework; no prior validation or external benchmark mentioned.

pith-pipeline@v0.9.0 · 5577 in / 1565 out tokens · 51824 ms · 2026-05-17T21:49:40.745477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Stable diffusion v2.1 and dreamstu- dio update.https : / / stability

    Stability AI. Stable diffusion v2.1 and dreamstu- dio update.https : / / stability . ai / blog / stablediffusion2 - 1 - release7 - dec - 2022,

  2. [2]

    Accessed: November 4, 2025. 2, 7

  3. [3]

    Easily acces- sible text-to-image generation amplifies demographic stereo- types at large scale

    Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. Easily acces- sible text-to-image generation amplifies demographic stereo- types at large scale. InProceedings of the 2023 ACM con- ference on fairness, accountability, and transparency, pages 1493–15...

  4. [4]

    Flux.1-schnell.https : / / huggingface

    Black Forest Labs. Flux.1-schnell.https : / / huggingface . co / black - forest - labs / FLUX . 1-schnell, 2024. Model card. 2

  5. [5]

    Montrage: Moni- toring training for attribution of generative diffusion models

    Jonathan Brokman, Omer Hofman, Roman Vainshtein, Amit Giloni, Toshiya Shimizu, Inderjeet Singh, Oren Rachmil, Alon Zolfi, Asaf Shabtai, Yuki Unno, et al. Montrage: Moni- toring training for attribution of generative diffusion models. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2024. 1, 2

  6. [6]

    Concreteness ratings for 40 thousand generally known en- glish word lemmas.Behavior research methods, 46(3):904– 911, 2014

    Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness ratings for 40 thousand generally known en- glish word lemmas.Behavior research methods, 46(3):904– 911, 2014. 7

  7. [7]

    The pri- vacy onion effect: Memorization is relative.Advances in Neural Information Processing Systems, 35:13263–13276,

    Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nico- las Papernot, Andreas Terzis, and Florian Tramer. The pri- vacy onion effect: Memorization is relative.Advances in Neural Information Processing Systems, 35:13263–13276,

  8. [8]

    Extracting training data from diffu- sion models

    Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagiel- ski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ip- polito, and Eric Wallace. Extracting training data from diffu- sion models. In32nd USENIX security symposium (USENIX Security 23), pages 5253–5270, 2023. 1, 2

  9. [9]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 4

  10. [10]

    The myth of culturally agnostic ai models

    Eva Cetinic. The myth of culturally agnostic ai models. arXiv preprint arXiv:2211.15271, 2022. 2

  11. [11]

    Ambient diffu- sion: Learning clean distributions from corrupted data.Ad- vances in Neural Information Processing Systems, 36:288– 313, 2023

    Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gol- lakota, Alex Dimakis, and Adam Klivans. Ambient diffu- sion: Learning clean distributions from corrupted data.Ad- vances in Neural Information Processing Systems, 36:288– 313, 2023. 2

  12. [12]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. IEEE Transactions on Big Data, 2025. 7

  13. [13]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  14. [14]

    Erasing concepts from diffusion models

    Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2426–2436, 2023. 1, 2

  15. [15]

    Imagen 4.https://deepmind

    Google DeepMind. Imagen 4.https://deepmind. google/models/imagen/, 2025. Model overview. 2

  16. [16]

    University of Chicago Press, 2007

    Robert Hariman and John Louis Lucaites.No caption needed: Iconic photographs, public culture, and liberal democracy. University of Chicago Press, 2007. 2

  17. [17]

    Ai art and its impact on artists

    Harry H Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru. Ai art and its impact on artists. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 363–374, 2023. 2

  18. [18]

    Computational hermeneutics: Evaluating generative ai as a cultural technology

    Cody Kommers, Ruth Ahnert, Maria Antoniak, Emmanouil Benetos, Steve Benford, Mercedes Bunz, Baptiste Carami- aux, Shauna Concannon, Martin Disley, James Dobson, et al. Computational hermeneutics: Evaluating generative ai as a cultural technology. 2025. 2

  19. [19]

    Stable bias: Evaluating societal representa- tions in diffusion models.Advances in Neural Information Processing Systems, 36:56338–56351, 2023

    Sasha Luccioni, Christopher Akiki, Margaret Mitchell, and Yacine Jernite. Stable bias: Evaluating societal representa- tions in diffusion models.Advances in Neural Information Processing Systems, 36:56338–56351, 2023. 1

  20. [20]

    Embracing new techniques in deep learning for estimating image memorabil- ity.Computational Brain & Behavior, 5(2):168–184, 2022

    Coen D Needell and Wilma A Bainbridge. Embracing new techniques in deep learning for estimating image memorabil- ity.Computational Brain & Behavior, 5(2):168–184, 2022. 7

  21. [21]

    Synthetic history: Evaluating visual representations of the past in dif- fusion models.arXiv preprint arXiv:2505.17064, 2025

    Maria-Teresa De Rosa Palmini and Eva Cetinic. Synthetic history: Evaluating visual representations of the past in dif- fusion models.arXiv preprint arXiv:2505.17064, 2025. 1

  22. [22]

    Photojournalism and foreign policy: Icons of outrage in international crises.(No Title), 1998

    David D Perlmutter. Photojournalism and foreign policy: Icons of outrage in international crises.(No Title), 1998. 2

  23. [23]

    A self-supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self-supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. 1, 3, 4

  24. [24]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

  25. [25]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  26. [26]

    Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention

    Jie Ren, Yaxin Li, Shenglai Zeng, Han Xu, Lingjuan Lyu, Yue Xing, and Jiliang Tang. Unveiling and mitigating mem- orization in text-to-image diffusion models through cross at- tention. InEuropean Conference on Computer Vision, pages 340–356. Springer, 2024. 2

  27. [27]

    The computational mem- orability of iconic images.Proceedings http://ceur-ws

    Lisa Saleh and Nanne van Noord. The computational mem- orability of iconic images.Proceedings http://ceur-ws. org ISSN, 1613:0073, 2022. 2

  28. [28]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 7

  29. [29]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 7

  30. [30]

    Beyond aesthetics: Cultural competence in text-to- image models.Advances in Neural Information Processing Systems, 37:13716–13747, 2024

    Nithish Kannen Senthilkumar, Arif Ahmad, Marco Andreetto, Vinodkumar Prabhakaran, Utsav Prabhu, Adji Bousso Dieng, Pushpak Bhattacharyya, and Shachi Dave. Beyond aesthetics: Cultural competence in text-to- image models.Advances in Neural Information Processing Systems, 37:13716–13747, 2024. 1

  31. [31]

    DINOv3

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 4

  32. [32]

    Diffusion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6048–6058, 2023. 1, 2, 4, 7

  33. [33]

    Understanding and mitigating copying in diffusion models.Advances in Neural Informa- tion Processing Systems, 36:47783–47803, 2023

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Understanding and mitigating copying in diffusion models.Advances in Neural Informa- tion Processing Systems, 36:47783–47803, 2023. 7, 8

  34. [34]

    Intrinsically mem- orable words have unique associations with their meanings

    Greta Tuckute, Kyle Mahowald, Phillip Isola, Aude Oliva, Edward Gibson, and Evelina Fedorenko. Intrinsically mem- orable words have unique associations with their meanings. Journal of Experimental Psychology: General, 2025. 7

  35. [35]

    Mapping the latent spaces of culture

    Ted Underwood. Mapping the latent spaces of culture. 2021. 2

  36. [36]

    The iconicity of the gen- erated image.arXiv preprint arXiv:2509.16473, 2025

    Nanne van Noord and Noa Garcia. The iconicity of the gen- erated image.arXiv preprint arXiv:2509.16473, 2025. 2

  37. [37]

    Navigating cultural chasms: Exploring and unlocking the cultural pov of text-to-image models.Transactions of the Association for Computational Linguistics, 13:142–166,

    Mor Ventura, Eyal Ben-David, Anna Korhonen, and Roi Re- ichart. Navigating cultural chasms: Exploring and unlocking the cultural pov of text-to-image models.Transactions of the Association for Computational Linguistics, 13:142–166,

  38. [38]

    Wikidata: a free collaborative knowledgebase.Communications of the ACM, 57(10):78–85, 2014

    Denny Vrande ˇci´c and Markus Krötzsch. Wikidata: a free collaborative knowledgebase.Communications of the ACM, 57(10):78–85, 2014. 2, 3

  39. [39]

    Evaluating data attribution for text-to-image models

    Sheng-Yu Wang, Alexei A Efros, Jun-Yan Zhu, and Richard Zhang. Evaluating data attribution for text-to-image models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7192–7203, 2023. 1, 2

  40. [40]

    Image copy detection for diffusion models.Advances in Neural In- formation Processing Systems, 37:14417–14456, 2024

    Wenhao Wang, Yifan Sun, Zhentao Tan, and Yi Yang. Image copy detection for diffusion models.Advances in Neural In- formation Processing Systems, 37:14417–14456, 2024. 1, 2, 4

  41. [41]

    On the de-duplication of laion-2b.arXiv preprint arXiv:2303.12733, 2023

    Ryan Webster, Julien Rabin, Loic Simon, and Frederic Ju- rie. On the de-duplication of laion-2b.arXiv preprint arXiv:2303.12733, 2023. 2

  42. [42]

    evaluating student performance

    Laura Weidinger, Inioluwa Deborah Raji, Hanna Wallach, Margaret Mitchell, Angelina Wang, Olawale Salaudeen, Rishi Bommasani, Deep Ganguli, Sanmi Koyejo, and William Isaac. Toward an evaluation science for generative ai systems.arXiv preprint arXiv:2503.05336, 2025. 2

  43. [43]

    De- tecting, explaining, and mitigating memorization in diffusion models

    Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. De- tecting, explaining, and mitigating memorization in diffusion models. InThe Twelfth International Conference on Learn- ing Representations, 2024. 2

  44. [44]

    Counter- factual memorization in neural language models.Advances in Neural Information Processing Systems, 36:39321–39362,

    Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. Counter- factual memorization in neural language models.Advances in Neural Information Processing Systems, 36:39321–39362,

  45. [45]

    Forget-me-not: Learning to for- get in text-to-image diffusion models

    Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to for- get in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024. 2

  46. [46]

    The Scream

    Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, and Sijia Liu. Defensive unlearning with adversarial training for robust concept erasure in diffusion models.Advances in neu- ral information processing systems, 37:36748–36776, 2024. 2 The Persistence of Cultural Memory: Investigating Multimodal Iconicity ...