pith. sign in

arxiv: 2606.08959 · v1 · pith:2HQDPD6Hnew · submitted 2026-06-08 · 💻 cs.CV · cs.CL

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

Pith reviewed 2026-06-27 17:15 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords ChinaHeritaQAvisual question answeringvision-language modelscultural reasoningUNESCO World HeritageChinese heritage sitesmultimodal benchmark
0
0 comments X

The pith

ChinaHeritaQA shows vision-language models beat humans on average for Chinese heritage questions but fail on culturally grounded reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChinaHeritaQA as a benchmark dataset of 2,279 images from Chinese UNESCO World Heritage sites paired with 14,133 bilingual multiple-choice questions. The questions cover seven cognitive dimensions ranging from basic recognition to historical periodization and architectural analysis, all guided by a UNESCO-aligned ontology and checked through human annotation. Evaluations of current models indicate they surpass human performance overall yet show clear weaknesses on tasks that require cultural and historical context, with results differing by dynasty and region. A reader would care because the work isolates a specific limitation: visual processing strength does not automatically produce cultural understanding.

Core claim

ChinaHeritaQA is a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models ex

What carries the argument

The ChinaHeritaQA dataset, which supplies images of heritage sites together with multiple-choice questions across seven cognitive dimensions created via a UNESCO-aligned ontology and human verification.

If this is right

  • State-of-the-art vision-language models exceed average human performance on the overall benchmark.
  • Models perform strongly on visual recognition questions but weakly on questions that require cultural and historical reasoning.
  • Model accuracy varies by the dynasty and geographic region of the heritage sites.
  • Strong performance on visual retrieval tasks does not carry over to cultural and historical understanding tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could serve as a template for creating similar benchmarks focused on heritage sites in other countries to test whether the same visual-versus-cultural gap appears elsewhere.
  • Training approaches that add explicit historical timelines or regional context might close the observed performance gap on dynasty-specific questions.
  • The variation across dynasties suggests that models may need separate handling of temporal cultural shifts rather than treating all heritage content uniformly.

Load-bearing premise

The 14,133 QA pairs accurately and consistently measure culturally grounded reasoning without introducing annotator bias or factual inconsistencies.

What would settle it

A re-annotation round by independent cultural experts that produces substantially different ground-truth answers on the historical periodization or architectural analysis questions would show the dataset does not reliably measure cultural reasoning.

Figures

Figures reproduced from arXiv: 2606.08959 by Anna-Carolina Haensch, Bolei Ma, Chengyan Wu, Daniel Hershcovich, Yi Zhang, Yong Cao.

Figure 1
Figure 1. Figure 1: The distribution of World Cultural Heritage [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall construction pipeline of ChinaHeritaQA. The framework consists of two main phases. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The chronological distribution of QA pairs in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Geographical heatmap of QA pairs. The data is mostly concentrated in central and southwestern China, particularly in Shanxi (1,812 pairs) and Chongqing (1,565 pairs). Shanxi has many well-preserved ancient wooden buildings and grottoes, while Chongqing features unique landscapes and cultural sites. Both generate high user interest and abundant photos on social me￾dia. In contrast, many regions have very li… view at source ↗
Figure 5
Figure 5. Figure 5: F1 comparison across question types in Chi [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean Macro-F1 across dynasties, grouped into four capability types. VLMs show region-level cultural grounding bias [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of wrong-answer types for Q2 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of wrong-answer dynasties for [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean Macro-F1 across seven macro-regions, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompts for VLMs in Chinese and English. 45.4% and a Fleiss’ κ of 0.592. The agreement levels vary significantly across categories, reflect￾ing the varying difficulty of the cultural heritage tasks. Evaluators demonstrated high consistency in identity recognition (Q1), visual grounding (Q2), description matching (Q3), and functional analysis (Q6), where Fleiss’ κ scores exceeded 0.62. In con￾trast,… view at source ↗
Figure 12
Figure 12. Figure 12: The instruction interface provided to hu [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A representative failure case of Q2 Visual [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Province-level Macro-F1 for five VLMs (CogVLM2 excluded), sorted by the Macro F1 column (cross [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Representative example of Q1 Identity Recognition in ChinaHeritaQA. The left side shows the visual [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Representative example of Q2 Visual Grounding in ChinaHeritaQA. The model is given a heritage-site [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Representative example of Q3 Description Matching in ChinaHeritaQA. The model must choose the [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Representative example of Q4 Historical Periodization in ChinaHeritaQA. The model must infer the [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Representative example of Q5 Historical Contextualization in ChinaHeritaQA. The model must select [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Representative example of Q6 Functional Analysis in ChinaHeritaQA. The model must infer the main [PITH_FULL_IMAGE:figures/full_fig_p019_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Representative example of Q7 Architectural Analysis in ChinaHeritaQA. The model must identify the [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗
read the original abstract

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces ChinaHeritaQA, a multimodal VQA benchmark with 2,279 in-the-wild images of Chinese UNESCO World Heritage sites paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions from basic recognition to historical periodization and architectural analysis. Constructed via a UNESCO-aligned heritage ontology and human annotation, the dataset is used to evaluate state-of-the-art VLMs, which are reported to exceed average human performance yet exhibit substantial task-level variation, excelling at visual recognition but struggling with culturally grounded reasoning, with further variation by dynasty and region.

Significance. If the dataset construction and evaluation hold, the work supplies a needed resource for probing cultural and historical reasoning gaps in VLMs beyond visual retrieval, with potential to guide development of more culturally aware multimodal models. The bilingual design and ontology grounding are positive features for cross-lingual and heritage-specific evaluation.

major comments (2)
  1. [dataset construction and annotation description] The abstract and dataset description claim the 14,133 QA pairs were 'verified through rigorous human annotation' ensuring 'linguistic quality and factual consistency,' yet no inter-annotator agreement scores, error rates, or validation statistics are supplied. This directly affects the central claim that the benchmark isolates culturally grounded reasoning from visual recognition and annotator bias.
  2. [evaluation and results section] The headline evaluation result (top models exceed humans on average but fail on cultural reasoning, with dynasty/region variation) rests on the assumption that the seven cognitive dimensions and UNESCO ontology operationalize cultural nuance without systematic factual drift or regional skew in the bilingual items; no error analysis or ontology breakdown is provided to support this separation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on ChinaHeritaQA. The comments highlight important aspects of dataset validation and evaluation that we will address in the revision. Below we respond point by point.

read point-by-point responses
  1. Referee: The abstract and dataset description claim the 14,133 QA pairs were 'verified through rigorous human annotation' ensuring 'linguistic quality and factual consistency,' yet no inter-annotator agreement scores, error rates, or validation statistics are supplied. This directly affects the central claim that the benchmark isolates culturally grounded reasoning from visual recognition and annotator bias.

    Authors: We agree that quantitative validation statistics strengthen the claims. The manuscript describes a multi-stage annotation process with heritage experts and consensus resolution, but does not report agreement metrics. In the revised version we will add an appendix with inter-annotator agreement (Cohen's kappa) computed from retained annotation logs, error rates from the verification stage, and a breakdown of resolved disagreements. This addition will directly support the isolation of cultural reasoning from annotator effects. revision: yes

  2. Referee: The headline evaluation result (top models exceed humans on average but fail on cultural reasoning, with dynasty/region variation) rests on the assumption that the seven cognitive dimensions and UNESCO ontology operationalize cultural nuance without systematic factual drift or regional skew in the bilingual items; no error analysis or ontology breakdown is provided to support this separation.

    Authors: We accept that an explicit error analysis and ontology distribution would better substantiate the separation of visual versus cultural reasoning. The current results section reports aggregate and per-dimension scores plus dynasty/region variation, but lacks per-ontology breakdowns and qualitative error examples. The revised manuscript will include (1) a table showing question distribution across the UNESCO-aligned ontology categories and (2) a qualitative error analysis highlighting cases where models succeed on recognition but fail on historical or architectural reasoning. These additions will clarify the operationalization of cultural nuance. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset creation and empirical evaluation only

full rationale

The paper introduces a new VQA dataset (ChinaHeritaQA) with 2,279 images and 14,133 QA pairs, describes its construction via a UNESCO-aligned ontology and human annotation, and reports empirical VLM evaluations. No mathematical derivations, fitted parameters, predictions, or uniqueness theorems appear. All load-bearing claims rest on the dataset itself and external model benchmarks rather than reducing to self-citations or input quantities by construction. This matches the default non-circular case for dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As an empirical dataset introduction paper, the work contains no mathematical free parameters, background axioms, or newly postulated entities; the contribution rests on data collection, ontology alignment, and human annotation rather than formal assumptions.

pith-pipeline@v0.9.1-grok · 5704 in / 1251 out tokens · 33324 ms · 2026-06-27T17:15:50.765219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 17 canonical work pages

  1. [1]

    W orld C uisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

    Winata, Genta Indra and Hudi, Frederikus and Irawan, Patrick Amadeus and Anugraha, David and Putri, Rifki Afina and Yutong, Wang and Nohejl, Adam and Prathama, Ubaidillah Ariq and Ousidhoum, Nedjma and Amriani, Afifa and Rzayev, Anar and Das, Anirban and Pramodya, Ashmari and Adila, Aulia and Wilie, Bryan and Mawalim, Candy Olivia and Lam, Cheng Ching and...

  2. [2]

    F oodie QA : A Multimodal Dataset for Fine-Grained Understanding of C hinese Food Culture

    Li, Wenyan and Zhang, Crystina and Li, Jiaang and Peng, Qiwei and Tang, Raphael and Zhou, Li and Zhang, Weijia and Hu, Guimin and Yuan, Yifei and S gaard, Anders and Hershcovich, Daniel and Elliott, Desmond. F oodie QA : A Multimodal Dataset for Fine-Grained Understanding of C hinese Food Culture. Proceedings of the 2024 Conference on Empirical Methods in...

  3. [3]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i8.32884 , abstractNote=

  4. [4]

    Seeing Culture: A Benchmark for Visual Reasoning and Grounding

    Satar, Burak and Ma, Zhixin and Irawan, Patrick Amadeus and Mulyawan, Wilfried Ariel and Jiang, Jing and Lim, Ee-Peng and Ngo, Chong-Wah. Seeing Culture: A Benchmark for Visual Reasoning and Grounding. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1131

  5. [5]

    2024 , isbn =

    Liu, Yuan and Duan, Haodong and Zhang, Yuanhan and Li, Bo and Zhang, Songyang and Zhao, Wangbo and Yuan, Yike and Wang, Jiaqi and He, Conghui and Liu, Ziwei and Chen, Kai and Lin, Dahua , title =. 2024 , isbn =. doi:10.1007/978-3-031-72658-3_13 , booktitle =

  6. [6]

    Visually Grounded Reasoning across Languages and Cultures

    Liu, Fangyu and Bugliarello, Emanuele and Ponti, Edoardo Maria and Reddy, Siva and Collier, Nigel and Elliott, Desmond. Visually Grounded Reasoning across Languages and Cultures. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.818

  7. [7]

    Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

    Yin, Da and Li, Liunian Harold and Hu, Ziniu and Peng, Nanyun and Chang, Kai-Wei. Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.162

  8. [8]

    Image Analysis and Processing – ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II , pages =

    Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Corsini, Massimiliano and Cucchiara, Rita , title =. Image Analysis and Processing – ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II , pages =. 2019 , isbn =. doi:10.1007/978-3-030-30645-8_66 , abstract =

  9. [9]

    2015 , eprint=

    Microsoft COCO: Common Objects in Context , author=. 2015 , eprint=

  10. [10]

    2019 , issue_date =

    Goyal, Yash and Khot, Tejas and Agrawal, Aishwarya and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. 2019 , issue_date =. doi:10.1007/s11263-018-1116-0 , journal =

  11. [11]

    2025 , url=

    Chaoyou Fu and Peixian Chen and Yunhang Shen and Yulei Qin and Mengdan Zhang and Xu Lin and Jinrui Yang and Xiawu Zheng and Ke Li and Xing Sun and Yunsheng Wu and Rongrong Ji and Caifeng Shan and Ran He , booktitle=. 2025 , url=

  12. [12]

    2023 , eprint=

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=

  13. [13]

    2024 , eprint=

    CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark , author=. 2024 , eprint=

  14. [14]

    MMBench: Is Your Multi-modal Model an All-around Player? , year =

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, Dahua Lin , journal =. MMBench: Is Your Multi-modal Model an All-around Player? , year =

  15. [15]

    Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation

    Zhou, Li and Yu, Lutong and Xie, Dongchu and Cheng, Shaohuan and Li, Wenyan and Li, Haizhou. Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1251

  16. [16]

    Recontextualizing Revitalization: A Mixed Media Approach to Reviving the N

    Yang, Ivory and Guo, Xiaobo and Wang, Yuxin and Zhang, Hefan and Jia, Yaning and Dinauer, William and Vosoughi, Soroush. Recontextualizing Revitalization: A Mixed Media Approach to Reviving the N. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.627

  17. [17]

    2024 , eprint=

    CogVLM2: Visual Language Models for Image and Video Understanding , author=. 2024 , eprint=

  18. [18]

    2024 , eprint=

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding , author=. 2024 , eprint=

  19. [19]

    2024 , eprint=

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. 2024 , eprint=

  20. [20]

    2026 , eprint=

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning , author=. 2026 , eprint=

  21. [21]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  23. [23]

    doi: 10.18653/v1/2020.emnlp-demos.6

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

  24. [24]

    Advances in Neural Information Processing Systems 32 , pages =

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems 32 , pages =. 2019 , publisher =

  25. [25]

    2003 , publisher=

    Heritage Tourism , author=. 2003 , publisher=

  26. [26]

    2011 , publisher=

    The Tourist Gaze 3.0 , author=. 2011 , publisher=

  27. [27]

    Cultural tourism: A review of recent research and trends , journal =

    Greg Richards , keywords =. Cultural tourism: A review of recent research and trends , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.jhtm.2018.03.005 , url =

  28. [28]

    Whither scenic beauty? Visual landscape quality assessment in the 21st century , journal =

    Terry C Daniel , keywords =. Whither scenic beauty? Visual landscape quality assessment in the 21st century , journal =. 2001 , note =. doi:https://doi.org/10.1016/S0169-2046(01)00141-4 , url =

  29. [29]

    2012 , publisher=

    Heritage and Social Media:. 2012 , publisher=

  30. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , title =...

  31. [31]

    MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham. MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational L...

  32. [32]

    CultureLLM: Incorporating Cultural Differences into Large Language Models , url =

    Li, Cheng and Chen, Mengzhuo and Wang, Jindong and Sitaram, Sunayana and Xie, Xing , booktitle =. CultureLLM: Incorporating Cultural Differences into Large Language Models , url =. doi:10.52202/079017-2693 , editor =

  33. [33]

    Challenges and Strategies in Cross-Cultural

    Hershcovich, Daniel and Frank, Stella and Lent, Heather and de Lhoneux, Miryam and Abdou, Mostafa and Brandl, Stephanie and Bugliarello, Emanuele and Cabello Piqueras, Laura and Chalkidis, Ilias and Cui, Ruixiang and Fierro, Constanza and Margatina, Katerina and Rust, Phillip and S gaard, Anders. Challenges and Strategies in Cross-Cultural NLP. Proceeding...

  34. [34]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  35. [35]

    Towards Cross-Modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

    Yuan, Junyi and Zhang, Jian and Wu, Fangyu and Lu, Huanda and Lu, Dongming and Wang, Qiufeng. Towards Cross-Modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution. Document Analysis and Recognition -- ICDAR 2025. 2026

  36. [36]

    2026 , eprint=

    VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding , author=. 2026 , eprint=

  37. [37]

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark , url =

    Romero, David and Lyu, Chenyang and Wibowo, Haryo Akbarianto and Lynn, Teresa and Hamed, Injy and Kishore, Aditya Nanda and Mandal, Aishik and Dragonetti, Alina and Abzaliev, Artem and Tonja, Atnafu Lambebo and Balcha, Bontu Fufa and Whitehouse, Chenxi and Salamea, Christian and Velasco, Dan John and Adelani, David Ifeoluwa and Le Meur, David and Villa-Cu...

  38. [38]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =