pith. sign in

arxiv: 2604.07338 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.CL· cs.MM

Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.MM
keywords vision-language modelscultural metadatacross-cultural benchmarkheritage imagesstructured inferenceLLM evaluationimage understanding
0
0 comments X

The pith

Vision-language models capture only fragmented signals when inferring structured cultural metadata from images and vary widely across regions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a multi-category benchmark to test how vision-language models extract structured details such as creator, origin, and period from cultural heritage images. It evaluates model outputs against expert reference annotations using an automated semantic alignment scorer and measures exact, partial, and attribute-level accuracy across different cultural regions. The central finding is that models rely on scattered visual cues rather than coherent cultural knowledge, producing predictions that shift sharply depending on the culture and the metadata category involved. A reader would care because accurate automated metadata could scale the organization of digital heritage collections, yet the observed inconsistencies show that current models lack the grounding needed for reliable cross-cultural use.

Core claim

The paper presents Appear2Meaning, a cross-cultural benchmark for structured cultural metadata inference from images. Using an LLM-as-Judge framework to measure semantic alignment with reference annotations, it reports exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

What carries the argument

The Appear2Meaning benchmark, which supplies images tagged by cultural region and requires models to infer multiple structured metadata attributes, scored via LLM-based semantic alignment against expert references.

If this is right

  • Models must move beyond isolated visual cues to build integrated cultural reasoning if they are to handle structured metadata tasks consistently.
  • Large gaps in accuracy by region and attribute type indicate that training data and architectures need targeted adjustments for underrepresented cultures.
  • Automated heritage metadata systems built on current VLMs will produce unreliable results until the observed inconsistencies are addressed.
  • Attribute-level breakdowns can serve as diagnostic tools to identify which metadata categories are most in need of model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark format could be reused to test whether similar fragmentation appears in other multimodal tasks that require historical or contextual knowledge.
  • Persistent regional differences may reflect imbalances in the visual and textual data used to train existing vision-language models.
  • Pairing the automated judge with periodic human expert checks could reduce the risk that evaluation itself embeds cultural skew.

Load-bearing premise

The LLM-as-Judge framework reliably measures semantic alignment with reference annotations without introducing its own cultural or interpretive biases.

What would settle it

A follow-up study in which human cultural experts directly rate the same model outputs and produce accuracy scores that match or diverge sharply from the LLM judge results across the full set of cultures and metadata types.

Figures

Figures reproduced from arXiv: 2604.07338 by Enze Zhang, Konstantinos Arvanitis, Md Mohsinul Kabir, Qianqian Xie, Sophia Ananiadou, Stavroula Golfomitsou, Yuechen Jiang.

Figure 1
Figure 1. Figure 1: Cultural heritage objects from four regions are used to evaluate vision-language models on structured metadata inference, with predictions [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data curation pipeline combining rule-based filtering and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example images for Object ID 1055_Butter Pat. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example images for Object ID 1513_Celery vase. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example images for Object ID 42_Andiron. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example images for Object ID 0f097d4a-4ca1-40fd-b562-ab41a411aff1. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example images for Object ID 333_Basin. Ground Truth: • Title: Basin • Culture: Chinese • Period: 1825–45 • Creator: Unknown Representative Predictions: • Qwen3-VL-Plus: Cantonese export porcelain with Eight Immortals; Qing dynasty Guangxu period (1875–1908) • GPT-5.4-mini: Chinese porcelain basin, possibly Qing dynasty workshop • Claude Haiku 4.5: Decorative porcelain bowl, East Asian tradition Analysis: … view at source ↗
Figure 8
Figure 8. Figure 8: Example images for Object ID 2b6e224c-686a-4b43-aa5a-1ef5520ef0ef. Top: full painting. Bottom: object back (left) and framed view (right). [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Appear2Meaning, a multi-category cross-cultural benchmark for inferring structured cultural metadata (e.g., creator, origin, period) from images. It evaluates VLMs via an LLM-as-Judge framework that scores semantic alignment with reference annotations, reporting exact-match, partial-match, and attribute-level accuracy across cultural regions and metadata types. The central claim is that current VLMs capture only fragmented signals, exhibit substantial performance variation, and produce inconsistent, weakly grounded predictions.

Significance. If the evaluation is robust, the benchmark would be a useful contribution to computer vision and cultural AI by providing a structured testbed for cultural reasoning capabilities beyond standard captioning. The cross-cultural design and multi-level accuracy metrics could help surface limitations in VLMs for heritage applications.

major comments (2)
  1. [LLM-as-Judge framework and Experiments] The headline claims about fragmented signals and cultural performance gaps rest entirely on LLM-as-Judge semantic alignment scores. No calibration data, human inter-annotator agreement, or judge-model ablation is reported, leaving open the possibility that the judge's own training distribution introduces systematic cultural bias that inflates the observed gaps (see the LLM-as-Judge framework description and the Experiments section).
  2. [Abstract and Results] The abstract states that results show 'substantial performance variation' but supplies neither dataset size, the list of evaluated VLMs, nor any numerical accuracy values. Without these, the magnitude and statistical reliability of the cross-cultural claims cannot be assessed (see Results section and any accompanying tables).
minor comments (1)
  1. [Abstract] The abstract would benefit from one or two concrete accuracy numbers or dataset scale figures to give readers an immediate sense of the effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive referee report. We address each major comment below and indicate the revisions we intend to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [LLM-as-Judge framework and Experiments] The headline claims about fragmented signals and cultural performance gaps rest entirely on LLM-as-Judge semantic alignment scores. No calibration data, human inter-annotator agreement, or judge-model ablation is reported, leaving open the possibility that the judge's own training distribution introduces systematic cultural bias that inflates the observed gaps (see the LLM-as-Judge framework description and the Experiments section).

    Authors: We agree that validating the LLM-as-Judge framework is important for supporting the central claims. The manuscript describes the prompting approach and semantic alignment scoring but does not report calibration data, inter-annotator agreement, or model ablations. In the revision we will expand the Experiments section with a new subsection on judge robustness. This will include: discussion of the specific judge model used, consistency checks via repeated prompting on a sample of cases, and explicit acknowledgment of possible cultural biases arising from the judge's training data. We will also add a Limitations paragraph noting that comprehensive human IAA and full ablations were not performed in the current study and constitute valuable future work. The reference annotations themselves were produced by regional cultural experts, and the judge is prompted strictly to measure alignment with those references rather than to perform independent cultural inference. revision: partial

  2. Referee: [Abstract and Results] The abstract states that results show 'substantial performance variation' but supplies neither dataset size, the list of evaluated VLMs, nor any numerical accuracy values. Without these, the magnitude and statistical reliability of the cross-cultural claims cannot be assessed (see Results section and any accompanying tables).

    Authors: We thank the referee for highlighting this presentational issue. While the Results section and tables already contain the dataset size, the full list of evaluated VLMs, and the numerical accuracy values (exact-match, partial-match, and attribute-level) that demonstrate the reported variation, the abstract itself remains high-level. In the revised version we will update the abstract to incorporate these key quantitative details so that readers can immediately gauge the scale and reliability of the cross-cultural performance gaps without first consulting the full paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with external annotations

full rationale

The paper introduces a cross-cultural benchmark for structured metadata inference and evaluates VLMs via LLM-as-Judge against reference annotations using exact-match, partial-match, and attribute-level accuracy. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct comparison to external human annotations rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for the central results. This is a standard empirical evaluation setup with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the accuracy of newly created reference annotations and the validity of the LLM-as-Judge evaluation protocol.

axioms (2)
  • domain assumption Reference annotations constitute accurate ground truth for cultural metadata attributes
    All accuracy metrics are computed by direct comparison to these annotations.
  • domain assumption LLM-as-Judge produces unbiased semantic alignment scores
    The framework is used to quantify exact-match, partial-match, and attribute-level accuracy.

pith-pipeline@v0.9.0 · 5446 in / 1188 out tokens · 59379 ms · 2026-05-10T17:45:18.868042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    Manar Abu Talib, Iman Ibrahim, and Manar Anwer Abusirdaneh. 2026. Reusabil- ity and Benchmarking Potential of Architectural Cultural Heritage Datasets for Generative AI: An Analytical Study.Expert Systems With Applications(2026). Published online 16 January 2026

  2. [2]

    Mistral AI. 2024. Pixtral: Mistral’s Vision-Language Models. https://mistral.ai

  3. [3]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  4. [4]

    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  5. [5]

    Hannah Andrews and Aurora Hawcroft. 2024. Articulating arts-led AI: artists and technological development in cultural policy.European Journal of Cultural Management and Policy(2024)

  6. [6]

    Anthropic. 2025. Claude 4 Model Family. https://www.anthropic.com

  7. [7]

    2016.Introduction to Metadata(3 ed.)

    Murtha Baca (Ed.). 2016.Introduction to Metadata(3 ed.). Getty Research Institute, Los Angeles. https://www.getty.edu/publications/intrometadata/

  8. [8]

    Longju Bai, Angana Borah, Oana Ignat, and Rada Mihalcea. 2025. The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning. InPro- ceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2025). Association for Computational Linguistics

  9. [9]

    Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See- Kiong Ng, and Heng Tao Shen. 2024. GalleryGPT: Analyzing Paintings with Large Multimodal Models. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). ACM, Melbourne, Australia

  10. [10]

    Eva Cetinic. 2021. Towards Generating and Evaluating Iconographic Image Captions of Artworks.Journal of Imaging7, 7 (2021), 123

  11. [11]

    Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, and Baosheng He. 2025. CompCap: Improving Multimodal Large Language Models with Composite Captions. InICCV. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Jiang et al

  12. [12]

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2023. PaLI: A Jointly-Scaled Multilingual Language-Image Model. InThe Eleventh International Conference on Learning Representations

  13. [13]

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Represen- tation Learning. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 104–120. doi:10...

  14. [14]

    Dario Cioni, Lorenzo Berlincioni, Federico Becattini, and Alberto Del Bimbo

  15. [15]

    InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023)

    Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023). IEEE

  16. [16]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: towards general-purpose vision-language models with instruction tuning. InPro- ceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran As...

  17. [17]

    Martin Doerr. 2003. The CIDOC Conceptual Reference Model: An Ontological Approach to Semantic Interoperability of Metadata.International Journal of Human-Computer Studies43, 5 (2003), 75–92. doi:10.1016/j.ijhcs.2003.10.003

  18. [18]

    Emma Duester. 2024. Digital art work and AI: a new paradigm for work in the contemporary art sector in China.European Journal of Cultural Management and Policy(2024)

  19. [19]

    Nicola Fanelli, Gennaro Vessio, and Giovanna Castellano. 2025. ARTSEEK: Deep Artwork Understanding via Multimodal In-Context Reasoning and Late Interaction Retrieval.arXiv preprint arXiv:2507.21917(July 2025)

  20. [20]

    Marco Fiorucci et al. 2020. Machine Learning for Cultural Heritage: A Survey. Pattern Recognition Letters(2020). https://doi.org/10.1016/j.patrec.2020.02.017

  21. [21]

    Sophie Frost and Lauren Vargas. 2025. Cultural work, wellbeing, and AI.Euro- pean Journal of Cultural Management and Policy(2025)

  22. [22]

    1973.The Interpretation of Cultures: Selected Essays

    Clifford Geertz. 1973.The Interpretation of Cultures: Selected Essays. Basic Books, New York

  23. [23]

    Alibaba Group. 2024. Qwen-VL: A Versatile Vision-Language Model. https://github.com/QwenLM/Qwen-VL

  24. [24]

    Alibaba Group. 2025. Qwen3-VL Technical Report. https://github.com/QwenLM

  25. [25]

    2010.Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works

    Patricia Harpring. 2010.Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Getty Research Institute, Los Angeles

  26. [26]

    Ehinger, and Jey Han Lau

    Yanbei Jiang, Krista A. Ehinger, and Jey Han Lau. 2024. KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI- 24), Special Track on AI, the Arts and Creativity. IJCAI

  27. [27]

    H. Lee. 2025. Lost in Translation: Probing Cultural Bias in Vision-Language Models. InICCV Workshop. https://openaccess.thecvf.com/content/ICCV2025W/ WCCA/papers/Lee_Lost_in_Translation_A_Position_Paper_on_Probing_ Cultural_Bias_ICCVW_2025_paper.pdf

  28. [28]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: bootstrap- ping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 814, 13 pages

  29. [29]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrap- ping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaud- huri, Stefanie Jegelka, Le Song, Csaba Szepesvari...

  30. [30]

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao

  31. [31]

    InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom)

    Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer- Verlag, Berlin, Heidelberg, 121–137. doi:10.1007/978-3-030-58577-8_8

  32. [32]

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. 2025. Describe anything: De- tailed localized image and video captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 21766–21777

  33. [33]

    Hong, Jiatao Gu, and Chris Callison-Burch

    Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani, Monica Lama, Mick Yang, Zixuan Bian, Ruohan Ren, Alan B. Hong, Jiatao Gu, and Chris Callison-Burch. 2025. DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions.arXiv preprint arXiv:2511.12452(November 2025)

  34. [34]

    Fang Liu, Mohan Zhang, Baoying Zheng, Shenglan Cui, Wentao Ma, and Zhix- iong Liu. 2023. Feature Fusion via Multi-Target Learning for Ancient Artwork Captioning.Information Fusion97 (2023)

  35. [35]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY , USA, Article 1516, 25 pages

  36. [36]

    Culturevlm: Characterizing and improving cultural understanding of vision-language models for over 100 countries,

    Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, and Jindong Wang. 2025. CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries. arXiv:2501.01282 [cs.AI] https://arxiv.org/abs/2501.01282

  37. [37]

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv preprint arXiv:1908.02265(2019)

  38. [38]

    Yue Lu, Chao Guo, Xingyuan Dai, and Fei-Yue Wang. 2024. ArtCap: A Dataset for Image Captioning of Fine Art Paintings.IEEE Transactions on Computational Social Systems(2024)

  39. [39]

    Oonagh Murphy and Elena Villaespesa. 2020. AI: A Museum Planning Toolkit

  40. [40]

    2008.Classification and Codes for Cultural Relics

    National Cultural Heritage Administration of China. 2008.Classification and Codes for Cultural Relics. Cultural Relics Press, Beijing

  41. [41]

    OpenAI. 2025. GPT-4.1 Mini. https://openai.com

  42. [42]

    OpenAI. 2026. GPT-5.4. https://openai.com

  43. [43]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine...

  44. [44]

    Artem Reshetnikov and Maria-Cristina Marinescu. 2025. Caption Generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models. InProceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025). Association for Computational Linguistics, 42–50

  45. [45]

    Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2019. Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain. InProceedings of the International Conference on Image Analysis and Processing (ICIAP 2019). Springer

  46. [46]

    Chen, Hwaran Lee, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee

    Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. 2026. BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models. arXiv:2510.11178 [cs.CV] https://arxiv.org/abs/2510.11178

  47. [47]

    Elena Villaespesa and Seth Crider. 2021. Computer Vision Tagging the Metropoli- tan Museum of Art’s Collection: A Comparison of Three Systems.Journal on Computing and Cultural Heritage(2021)

  48. [48]

    Elena Villaespesa and Seth Crider. 2021. A critical comparison analysis be- tween human and machine-generated tags for the Metropolitan Museum of Art’s collection.Journal of Documentation(2021)

  49. [49]

    Elena Villaespesa and Oonagh Murphy. 2021. This is not an apple! Benefits and challenges of applying computer vision to museum collections.Museum Management and Curatorship(2021)

  50. [50]

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)

  51. [51]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...

  52. [52]

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. OFA: Unifying Architec- tures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. ...

  53. [53]

    Paula Westenberger and Despoina Farmaki. 2025. Artificial intelligence for cul- tural heritage research: the challenges in UK copyright law and policy.European Journal of Cultural Management and Policy(2025)

  54. [54]

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. InProceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds...

  55. [55]

    Haorui Yu, Xuehang Wen, Fengrui Zhang, and Qiufeng Yi. 2026. A Multicul- tural Vision-Language Benchmark for Evaluating Cultural Understanding.arXiv preprint arXiv:2601.07986(2026). https://arxiv.org/html/2601.07986v3

  56. [56]

    Cheng Zhang, Hongxia Xie, Bin Wen, Songhan Zuo, Ruoxuan Zhang, and Wen- Huang Cheng. 2025. EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation. InProceedings of the 33rd ACM International Conference Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from ImagesACM MM ’26, November 10–14, 2026, Rio de...

  57. [57]

    Baoying Zheng, Fang Liu, Mohan Zhang, Tongqing Zhou, Shenglan Cui, Yunfan Ye, and Yeting Guo. 2023. Image Captioning for Cultural Artworks: A Case Study on Ceramics.Multimedia Systems29 (2023), 3223–3243

  58. [58]

    Eight Immortals

    Ping Zhong, Wenjin Hu, Yinqiu Zhao, and Fujun Zhang. 2026. Geo-TCAM: A Thangka Captioning Method Integrating Topic Modeling with Geometry-Guided Spatial Attention.npj Heritage Science14 (2026), 87. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Jiang et al. A Case Studies and Error Analysis We analyze prediction outputs across models and identif...