pith. machine review for the scientific record. sign in

arxiv: 2604.11632 · v1 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.CL
keywords Chinese artvision-language modelsbenchmarkauthenticity discriminationart understandingCURATORQAconnoisseurshipcultural heritage
0
0 comments X

The pith

Vision-language models post high overall scores on Chinese art questions yet drop sharply on evidence linking, expert-style appreciation, and authenticity discrimination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CARTBENCH, a museum-sourced benchmark that tests VLMs on Chinese artworks through four progressive subtasks: evidence-grounded recognition and reasoning, structured expert-style captions, rated reinterpretations, and diagnostic authenticity checks against visually similar confounds. Evaluation of nine representative models shows that strong aggregate accuracy on the recognition task conceals large failures on harder subtasks such as style-to-period inference and evidence chaining. Long-form appreciation outputs remain distant from expert references, while authenticity discrimination hovers near chance. A sympathetic reader would care because these gaps indicate that current models still lack the connoisseur-level visual and interpretive skills required for reliable cultural-heritage applications.

Core claim

CARTBENCH comprises CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for four-section expert-style appreciation, REINTERPRET for defensible reinterpretation scored against expert ratings, and CONNOISSEURPAIRS for authenticity discrimination under similar visual confounds. The benchmark is constructed by aligning Wikidata image-bearing Palace Museum objects with authoritative catalog pages across five art categories and multiple dynasties. Across nine VLMs, high CURATORQA accuracy masks sharp performance drops on hard evidence linking and style-to-period inference; long-form appreciation stays far from expert references; and authenticity discrimination remains near-ch

What carries the argument

The CARTBENCH benchmark with its four museum-grounded subtasks that escalate from basic recognition to connoisseur-level authenticity discrimination.

If this is right

  • Aggregate benchmark scores are insufficient to certify expert-level cultural reasoning in VLMs.
  • Models require explicit training on evidence chaining and stylistic period mapping to close the observed gaps.
  • Long-form generation must be evaluated against expert writing norms rather than generic fluency metrics.
  • Authenticity tasks near chance imply that current visual representations do not capture subtle diagnostic features used by human connoisseurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Parallel benchmarks for other national art traditions could reveal whether the same pattern of superficial accuracy holds cross-culturally.
  • Training data augmented with explicit expert annotations on subtle stylistic cues might improve performance on the harder subtasks.
  • Museums considering VLM tools for cataloging or visitor guidance should add human oversight specifically for authenticity and reinterpretation outputs.

Load-bearing premise

The alignment of Wikidata objects with Palace Museum catalog pages and expert ratings supplies reliable, unbiased ground truth for reinterpretation and authenticity tasks.

What would settle it

A fresh round of expert ratings on the REINTERPRET and CONNOISSEURPAIRS items that raises authenticity accuracy well above chance or reverses the ranking of models on hard subtasks would falsify the reported limitations.

Figures

Figures reproduced from arXiv: 2604.11632 by Hidetaka Kamigaito, Hongyao Li, Taro Watanabe, Xuan Zhou, Xuefeng Wei, Yusuke Sakai, Zhi Qu, Zhixuan Wang.

Figure 1
Figure 1. Figure 1: Overview of CARTBENCH construction and task instantiation. Top: Phase 1 retrieves image-bearing Palace Museum objects from Wikidata, Phase 2 aligns them to official catalog pages to collect curatorial descriptions, and Phase 3 performs expert filtering and category assignment to yield museum-grounded artwork–appreciation pairs. Bottom: the curated pairs are instantiated into four tasks: CURATORQA(evidence … view at source ↗
Figure 2
Figure 2. Figure 2: Type and era distributions of CURATORQA entries: (left) five art categories; (right) top-8 merged eras. to objects associated with the Palace Museum (Bei￾jing)2 and require image availability. This initial Wikidata query yields 127,601 image-bearing items linked to the Palace Museum. We then align each object by title to the museum catalog page and collect the on-page curatorial description. 3.2 Category a… view at source ↗
Figure 3
Figure 3. Figure 3: Survey Questionnaire for REINTERPRET-Part1 [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Survey Questionnaire for REINTERPRET-Part2 [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Survey Questionnaire for C [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Survey Questionnaire for CONNOISSEURPAIRS-Part2. 中国朝代/Corresponding English 时间/time 新石器时代/Neolithic period, China c. 10,000–2,000 BCE 商/Shang Dynasty c. 1,600–1,046 BCE 西周/Western Zhou Dynasty 1,046–771 BCE 战国/Warring States period c. 475–221 BCE 秦/Qin Dynasty 221–206 BCE 汉/Han Dynasty 202 BCE – 220 CE 三国/Three Kingdoms period 220–280 CE 晋/Jin Dynasty 265–420 CE 十六国/Sixteen Kingdoms period 304–439 CE 南北朝/N… view at source ↗
Figure 7
Figure 7. Figure 7: Chinese dynasties and historical periods considered in this work, listing the Chinese names with English [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CArtBench, a museum-grounded benchmark for VLMs on Chinese art with four subtasks (CURATORQA for evidence-grounded recognition/reasoning, CATALOGCAPTION for structured expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination). Built via Wikidata-to-Palace Museum alignment across five categories and dynasties, evaluation of nine VLMs shows high CURATORQA accuracy masking drops on hard evidence linking and style-to-period inference, long-form appreciation far from expert references, and near-chance authenticity discrimination.

Significance. If the ground-truth labels hold, this benchmark is significant for exposing limitations in current VLMs on connoisseur-level cultural reasoning beyond standard QA, providing a reproducible framework for Chinese art interpretation that could guide future model development in heritage domains.

major comments (2)
  1. [Dataset Construction] The construction of REINTERPRET and CONNOISSEURPAIRS (via Wikidata alignment to Palace Museum catalog pages) lacks any reported inter-annotator agreement, mismatch rates, or quantitative audit of label quality for expert ratings and attributions; this directly undermines the validity of the headline claims on defensible reinterpretation and near-chance authenticity discrimination.
  2. [Evaluation and Results] No dataset size, statistical significance tests, or exact model versions are reported in the evaluation of the nine VLMs, making it impossible to verify the claimed sharp drops on hard subtasks or the near-chance authenticity results.
minor comments (2)
  1. [Abstract] The abstract would benefit from including the total number of images or questions per subtask and a brief note on the nine models evaluated.
  2. [Results] Consider adding a table summarizing inter-subtask performance correlations or baseline comparisons to prior art benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript introducing CArtBench. We address each of the major comments below and commit to revisions that will enhance the clarity and rigor of the paper.

read point-by-point responses
  1. Referee: [Dataset Construction] The construction of REINTERPRET and CONNOISSEURPAIRS (via Wikidata alignment to Palace Museum catalog pages) lacks any reported inter-annotator agreement, mismatch rates, or quantitative audit of label quality for expert ratings and attributions; this directly undermines the validity of the headline claims on defensible reinterpretation and near-chance authenticity discrimination.

    Authors: We appreciate the referee pointing out the need for greater transparency in dataset construction. The alignments for REINTERPRET and CONNOISSEURPAIRS are based on direct mappings from Wikidata to the authoritative Palace Museum catalog pages, which provide the expert attributions and ratings. While we did not include inter-annotator agreement metrics in the original submission (as the primary labels derive from museum experts rather than multiple annotators), we acknowledge this as a gap. In the revised version, we will add a dedicated section on dataset quality, including mismatch rates from the alignment process, any available agreement statistics from the source catalogs, and a quantitative audit of the expert ratings used. This will strengthen the validity of our claims regarding reinterpretation and authenticity discrimination. revision: yes

  2. Referee: [Evaluation and Results] No dataset size, statistical significance tests, or exact model versions are reported in the evaluation of the nine VLMs, making it impossible to verify the claimed sharp drops on hard subtasks or the near-chance authenticity results.

    Authors: We agree that these details are crucial for reproducibility and verification of results. Although the full manuscript includes dataset sizes in the 'Dataset Construction' section (e.g., number of instances per subtask and dynasty), we will move and explicitly highlight these figures in the 'Evaluation and Results' section. We will also incorporate statistical significance tests, such as paired t-tests or bootstrap methods appropriate for the task metrics, to confirm the observed drops on hard subtasks and the near-chance performance on authenticity. Finally, we will specify the exact model versions and checkpoints used for all nine VLMs evaluated. These additions will make the results fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and VLM evaluation with no derivations or self-referential reductions

full rationale

The paper introduces CARTBENCH by aligning existing external sources (Wikidata objects with Palace Museum catalog pages) and performs direct empirical evaluation of nine VLMs across four subtasks (CURATORQA, CATALOGCAPTION, REINTERPRET, CONNOISSEURPAIRS). No equations, fitted parameters, or predictions are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The central claims are observational results from model performance against the constructed labels, which are externally sourced rather than internally derived. This is a standard self-contained benchmark paper with no load-bearing steps that match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that museum catalog alignments and expert ratings constitute valid ground truth; no free parameters, mathematical axioms, or new entities are introduced.

axioms (1)
  • domain assumption Expert ratings and authoritative catalog pages provide reliable ground truth for art appreciation and authenticity tasks
    Invoked in construction of REINTERPRET and CONNOISSEURPAIRS subtasks and in interpreting model performance against expert references.

pith-pipeline@v0.9.0 · 5482 in / 1259 out tokens · 37654 ms · 2026-05-10T15:17:24.530599+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages

  1. [1]

    In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 11569–11579

    Artemis: Affective language for visual art. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 11569–11579. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE/CVF Con- fere...

  2. [2]

    Preprint, arXiv:2401.14011

    Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. Preprint, arXiv:2401.14011. Feng Jin, Qingling Chang, and Zehua Xu. 2023. Muse- umqa: A fine-grained question answering dataset for museums and artifacts. InProceedings of the 2023 6th International Conference on Machine Learning and Natural Language Processing (MLN...

  3. [3]

    标题/作者/年代/著录/历 史事 件细节 /释文

    For each sampled w, we compute KPIw =P i wi mi where mi denotes the corresponding metric value, then obtain a model ranking induced by KPIw. We compare each ranking to the default- weight ranking using Spearman’sρ and Kendall’sτ. We also compute (i)Top-1 same rate, the fraction of scenarios in which the best-performingmodel (excluding human baselines) is ...