M-ArtAgent: Evidence-Based Multimodal Agent for Implicit Art Influence Discovery

Hanyi Liu; Heran Yang; Minghao Wang; Yuhang Xie; Zhonghao Jiu

arxiv: 2604.07468 · v1 · submitted 2026-04-08 · 💻 cs.AI

M-ArtAgent: Evidence-Based Multimodal Agent for Implicit Art Influence Discovery

Hanyi Liu , Zhonghao Jiu , Minghao Wang , Yuhang Xie , Heran Yang This is my paper

Pith reviewed 2026-05-10 18:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords implicit art influencemultimodal agentart attributionevidence-based reasoningReAct protocolstyle analysisiconographic retrievalinfluence benchmark

0 comments

The pith

M-ArtAgent reframes implicit art influence discovery as probabilistic adjudication using a four-phase evidence protocol.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an agent that treats undocumented artistic influences as a problem of building and testing verifiable evidence chains rather than measuring visual similarity alone. It follows a controlled sequence of investigation from images and biographies, corroboration against historical axioms, adversarial falsification by a separate critic, and final verdict. Specialized operators ground style comparisons in formal analysis and iconographic retrieval in established classification systems to keep intermediate steps auditable. On a balanced set of 100 artists and 2000 directed influence pairs, the approach yields strong detection performance that holds after explicit influence phrases are masked. This shows that domain-constrained verification can improve attribution reliability over pattern matching or unguided language model output.

Core claim

M-ArtAgent assembles evidence chains from images and biographies under art-historical axioms, subjects each hypothesis to prompt-isolated adversarial falsification, and reaches 83.7 percent positive-class F1, 0.666 Matthews correlation coefficient, and 0.910 ROC-AUC on the WIB-100 benchmark; these gains remain after leakage controls and phrase masking, establishing that historically grounded adjudication outperforms embedding similarity or unguided multimodal output for implicit influence attribution.

What carries the argument

Four-phase protocol (Investigation, Corroboration, Falsification, Verdict) run by a ReAct-style controller that deploys StyleComparator for formal style analysis and ConceptRetriever for ICONCLASS iconographic grounding to produce auditable claims.

If this is right

Attributions become traceable to specific image features, biographical facts, and axiomatic checks rather than opaque similarity scores.
Performance stays high when obvious influence language is removed, indicating the method relies on deeper visual and contextual reasoning.
The same controller and operators can in principle be applied to other attribution tasks that require domain rules and falsification steps.
Benchmarks built around directed pairs and leakage controls provide a clearer testbed for evaluating evidence-based agents in cultural domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on influence relations across other visual media such as photography or film to check whether the same evidence protocol transfers.
Incorporating newly digitized archival documents as additional input sources might further strengthen the corroboration and falsification phases.
If the critic component is made more independent, the overall system might serve as a template for AI tools in fields like legal precedent analysis or scientific claim verification where falsification is essential.

Load-bearing premise

The protocol with its isolated critic and art-historical axioms produces attributions that align more closely with historical validity than embedding similarity or unguided model output, and the WIB-100 labels form an unbiased ground truth for implicit influence.

What would settle it

Direct comparison of the agent's attributions on a new set of artist pairs against independent judgments by multiple art historians who have no access to the agent's evidence chains or the original benchmark labels.

Figures

Figures reproduced from arXiv: 2604.07468 by Hanyi Liu, Heran Yang, Minghao Wang, Yuhang Xie, Zhonghao Jiu.

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Table VII shows that M-ArtAgent achieves the strongest overall performance among all compared systems. Over five folds, it reaches 83.2 ± 1.1% MacroF1 and 0.666 ± 0.021 MCC while preserving both high recall (86.0 ± 1.4%) and high specificity (80.5 ± 1.8%). GalleryGPT remains the strongest overall baseline, but it trails by 24.5 points in specificity and by 0.272 MCC. Among the newly added KG comparators,… view at source ↗

**Figure 3.** Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: FIGURE 5 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Implicit artistic influence, although visually plausible, is often undocumented and thus poses a historically constrained attribution problem: resemblance is necessary but not sufficient evidence. Most prior systems reduce influence discovery to embedding similarity or label-driven graph completion, while recent multimodal large language models (LLMs) remain vulnerable to temporal inconsistency and unverified attributions. This paper introduces M-ArtAgent, an evidence-based multimodal agent that reframes implicit influence discovery as probabilistic adjudication. It follows a four-phase protocol consisting of Investigation, Corroboration, Falsification, and Verdict governed by a Reasoning and Acting (ReAct)-style controller that assembles verifiable evidence chains from images and biographies, enforces art-historical axioms, and subjects each hypothesis to adversarial falsification via a prompt-isolated critic. Two theory-grounded operators, StyleComparator for Wolfflin formal analysis and ConceptRetriever for ICONCLASS-based iconographic grounding, ensure that intermediate claims are formally auditable. On the balanced WikiArt Influence Benchmark-100 (WIB-100) of 100 artists and 2,000 directed pairs, M-ArtAgent achieves 83.7% positive-class F1, 0.666 Matthews correlation coefficient (MCC), and 0.910 area under the receiver operating characteristic curve (ROC-AUC), with leakage-control and robustness checks confirming that the gains persist when explicit influence phrases are masked. By coupling multimodal perception with domain-constrained falsification, M-ArtAgent demonstrates that implicit influence analysis benefits from historically grounded adjudication rather than pattern matching alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M-ArtAgent adds a four-phase ReAct protocol with style and iconography operators to multimodal LLMs for implicit art influence, but the WIB-100 labels may not be independent enough to prove the method beats retrieval.

read the letter

The paper's core move is to treat implicit influence as a probabilistic adjudication task rather than similarity search. It runs a ReAct controller through Investigation, Corroboration, Falsification, and Verdict, with a prompt-isolated critic and two operators: StyleComparator for Wolfflin-style formal checks and ConceptRetriever for ICONCLASS grounding. That combination is not just another LLM prompt; it tries to enforce auditable evidence chains and art-historical constraints at each step. On the balanced WIB-100 benchmark of 100 artists and 2000 directed pairs, it reports 83.7% positive F1, 0.666 MCC, and 0.910 AUC, with checks that the gains hold after masking explicit phrases. Those numbers and the leakage controls are the concrete results worth noting. The protocol itself is new for this narrow task and gives a clearer structure than plain embedding or unguided generation. The falsification phase in particular is a reasonable attempt to push back against hallucinated attributions. The main weakness is the ground truth. The abstract and stress-test note both flag that the 2000 pairs come from WikiArt-derived sources; if those overlap with the model's pretraining data, the performance could reflect recovered associations rather than genuine adjudication of undocumented influence. Masking phrases does not fully close that gap, and the paper does not appear to provide inter-annotator agreement or primary-source-only labeling to break the circularity. No error bars or full ablation isolating the critic are mentioned in the summary, which leaves the contribution of each phase unclear. This work is for people already working on AI tools for art history or digital humanities who want a more constrained agent setup. A reader focused on agentic workflows or cultural attribution tasks would find the protocol design useful to examine. It deserves peer review because the structured falsification idea is worth testing against stronger baselines and cleaner labels, even if the current evaluation needs tightening on the data side. I would send it to referees with a request to address the benchmark independence question directly.

Referee Report

4 major / 1 minor

Summary. The paper introduces M-ArtAgent, a multimodal agent that uses a four-phase ReAct-style protocol (Investigation, Corroboration, Falsification, Verdict) with art-historical axioms, a prompt-isolated critic, StyleComparator for formal analysis, and ConceptRetriever for iconographic grounding to perform evidence-based discovery of implicit artistic influences. It claims this approach yields historically valid attributions superior to embedding similarity or unguided LLMs, supported by 83.7% positive-class F1, 0.666 MCC, and 0.910 ROC-AUC on the balanced WIB-100 benchmark (100 artists, 2000 directed pairs), with robustness shown under explicit-phrase masking.

Significance. If the WIB-100 labels constitute independent ground truth and the protocol components are validated through ablations, the work would offer a structured, falsifiable framework for multimodal agents in art history that prioritizes verifiable evidence chains over pattern matching. This could influence agent design in cultural heritage domains by demonstrating the utility of domain axioms and adversarial falsification. The explicit operators and leakage controls are strengths that provide a reproducible template, though the absence of statistical details and component ablations currently limits the strength of the performance claims.

major comments (4)

[Abstract] Abstract: The reported metrics (83.7% F1, 0.666 MCC, 0.910 AUC) are presented without error bars, confidence intervals, number of runs, or statistical significance tests, making it impossible to determine whether the gains over baselines are robust or could arise from variance in the 2000-pair evaluation.
[Abstract] Abstract: No description is given of how the 2000 directed pairs in WIB-100 were labeled (e.g., source of annotations, inter-annotator agreement, or restriction to primary sources), which is load-bearing for the central claim that the protocol produces historically valid attributions rather than retrieving associations from training data.
[Abstract] Abstract: The manuscript provides no ablation isolating the contribution of the Falsification phase or the prompt-isolated critic, both of which are presented as essential to the evidence-based adjudication; without this, the superiority over unguided LLM output cannot be attributed to the claimed protocol elements.
[Abstract] Abstract: The leakage-control experiment masks explicit influence phrases but does not address potential broader overlap between WIB-100 labels (sourced from art-historical texts) and LLM pretraining corpora, leaving open the possibility that performance reflects retrieval rather than adjudication of undocumented implicit influences.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the number of artists and pairs in WIB-100 earlier in the performance sentence for immediate clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments point by point below, and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported metrics (83.7% F1, 0.666 MCC, 0.910 AUC) are presented without error bars, confidence intervals, number of runs, or statistical significance tests, making it impossible to determine whether the gains over baselines are robust or could arise from variance in the 2000-pair evaluation.

Authors: We agree that providing statistical details is essential for assessing the robustness of our results. In the revised manuscript, we will report results from multiple independent runs (specifying the number, e.g., 5), include error bars and 95% confidence intervals for the metrics, and conduct appropriate statistical tests to compare against baselines. This will allow readers to evaluate the significance of the observed improvements. revision: yes
Referee: [Abstract] Abstract: No description is given of how the 2000 directed pairs in WIB-100 were labeled (e.g., source of annotations, inter-annotator agreement, or restriction to primary sources), which is load-bearing for the central claim that the protocol produces historically valid attributions rather than retrieving associations from training data.

Authors: The WIB-100 benchmark draws its influence labels from established art-historical sources associated with the WikiArt dataset. We acknowledge that the current manuscript lacks a detailed account of the labeling process. In the revision, we will include an expanded description of the benchmark, specifying the sources of the annotations (art history references), any available details on annotation methodology, and discuss the extent to which labels are based on documented rather than inferred influences. We will also note limitations regarding inter-annotator agreement if comprehensive statistics are not available. revision: yes
Referee: [Abstract] Abstract: The manuscript provides no ablation isolating the contribution of the Falsification phase or the prompt-isolated critic, both of which are presented as essential to the evidence-based adjudication; without this, the superiority over unguided LLM output cannot be attributed to the claimed protocol elements.

Authors: We recognize the value of component ablations to validate the contributions of the Falsification phase and the prompt-isolated critic. Although the manuscript includes overall performance and some robustness checks, dedicated ablations for these elements were not reported. We will add these ablations in the revised version, comparing performance with and without the Falsification phase and the critic, to better attribute the gains to the specific protocol components. revision: yes
Referee: [Abstract] Abstract: The leakage-control experiment masks explicit influence phrases but does not address potential broader overlap between WIB-100 labels (sourced from art-historical texts) and LLM pretraining corpora, leaving open the possibility that performance reflects retrieval rather than adjudication of undocumented implicit influences.

Authors: The leakage-control experiment was designed to test reliance on explicit phrases by masking them. We agree that it does not fully rule out retrieval from pretraining data for more implicit associations. In the revision, we will elaborate on this limitation in the discussion section and strengthen the argument by emphasizing how the falsification phase and evidence chain requirements mitigate pure retrieval. If possible, we will consider additional experiments, such as evaluating on influences documented after the model's training cutoff, though this may be constrained by data availability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical metrics on external benchmark

full rationale

The paper presents an agent architecture evaluated via direct empirical measurements (F1, MCC, AUC) on the WIB-100 benchmark of 2000 directed pairs, with leakage controls that mask explicit phrases. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the provided text. The four-phase ReAct protocol and operators (StyleComparator, ConceptRetriever) are described as design choices, not quantities derived from the reported performance numbers. The central claim reduces to measured accuracy against held-out labels rather than any construction that equates outputs to inputs by definition. This is a standard non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and enforceability of unspecified art-historical axioms plus the assumption that the WIB-100 labels are reliable ground truth for implicit influence.

axioms (1)

domain assumption Art-historical axioms can be enforced inside the agent loop to constrain attributions
Abstract states the controller 'enforces art-historical axioms' without listing them or showing how enforcement is implemented.

pith-pipeline@v0.9.0 · 5586 in / 1430 out tokens · 29781 ms · 2026-05-10T18:19:05.804480+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean, Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

M-ArtAgent follows a four-phase protocol consisting of Investigation, Corroboration, Falsification, and Verdict governed by a ReAct-style controller... Two theory-grounded operators, StyleComparator for Wölfflin formal analysis and ConceptRetriever for ICONCLASS-based iconographic grounding
IndisputableMonolith/Foundation/DimensionForcing.lean, AlexanderDuality.lean alexander_duality_circle_linking, D3_admits_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Wölfflinian formal manifold W = span(B) ... orthogonal projection PW : R^d → R^5, w(I) = B^T z(I)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Data science and digital art history,

L. Manovich, “Data science and digital art history,” Int. J. Digital Art History, no. 1, pp. 10–35, 2015

work page 2015
[2]

The shape of art history in the eyes of the machine,

A. Elgammal, B. Liu, D. Kim, M. Elhoseiny, and M. Mazzone, “The shape of art history in the eyes of the machine,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018

work page 2018
[3]

GalleryGPT: Analyzing paintings with large multimodal models,

Y. Bin, W. Shi, Y. Ding, Z. Hu, Z. Wang, Y. Yang, S.-K. Ng, and H. T. Shen, “GalleryGPT: Analyzing paintings with large multimodal models,” in Proc. 32nd ACM Int. Conf. Multimedia (MM), 2024, pp. 7734–7743

work page 2024
[4]

Diffusion based augmentation for captioning and retrieval in cultural heritage,

D. Cioni, L. Berlincioni, F. Becattini, and A. Del Bimbo, “Diffusion based augmentation for captioning and retrieval in cultural heritage,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), 2023

work page 2023
[5]

Caption generation in cultural heritage: Crowdsourced data and tuning multimodal large language models,

A. Reshetnikov and M.-C. Marinescu, “Caption generation in cultural heritage: Crowdsourced data and tuning multimodal large language models,” in Proc. 1st Workshop Lang. Models Underserved Communities (LM4UC), 2025, pp. 42–50

work page 2025
[6]

Multimodal metadata assignment for cultural heritage arti- facts,

L. Rei, D. Mladenić, M. Dorozynski, F. Rottensteiner, T. Schleider, R. Troncy, J. S. Lozano, and M. G. Salvatella, “Multimodal metadata assignment for cultural heritage arti- facts,” Multimedia Systems, vol. 29, pp. 847–869, 2023

work page 2023
[7]

Towards cross-modal retrieval in chinese cultural heritage documents: Dataset and solution,

J. Yuan, J. Zhang, F. Wu, D. Lu, H. Lu, and Q. Wang, “Towards cross-modal retrieval in chinese cultural heritage documents: Dataset and solution,” in Proc. Int. Conf. Docu- ment Anal. Recognit. (ICDAR). Springer, 2025, pp. 570–586

work page 2025
[8]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

work page 2021
[9]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 19730–19742

work page 2023
[10]

A survey on knowledge- enhanced multimodal learning,

M. Lymperaiou and G. Stamou, “A survey on knowledge- enhanced multimodal learning,” Artif. Intell. Rev., vol. 57, p. 284, 2024

work page 2024
[11]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,” Front. Comput. Sci., vol. 18, no. 6, p. 186345, 2024

work page 2024
[12]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” in Proc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023
[13]

Adaptation of agentic AI: A survey of post-training, memory, and skills.arXiv preprint arXiv:2512.16301, 2025

P. Jiang, J. Lin, Z. Shi, Z. Wang, L. He, Y. Wu, M. Zhong, P. Song, Q. Zhang et al., “Adaptation of agentic AI,” arXiv preprint arXiv:2512.16301, 2025

work page arXiv 2025
[14]

Benchmarking vision language models for cultural understanding,

S. Nayak, K. Jain, R. Awal, S. Reddy, S. V. Steenkiste, L. A. Hendricks, K. Stanczak, and A. Agrawal, “Benchmarking vision language models for cultural understanding,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024, pp. 5769–5790

work page 2024
[15]

Pearl, Causality: Models, Reasoning, and Inference, 2nd ed

J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, 2009

work page 2009
[16]

Toward automated discovery of artistic influence,

B. Saleh, K. Abe, R. S. Arora, and A. Elgammal, “Toward automated discovery of artistic influence,” Multimedia Tools Appl., vol. 75, no. 7, pp. 3565–3591, 2016

work page 2016
[17]

Quantifying creativity in art networks,

A. Elgammal and B. Saleh, “Quantifying creativity in art networks,” in Proc. 6th Int. Conf. Comput. Creativity (ICCC), 2015, pp. 39–46

work page 2015
[18]

WP-CLIP: Leveraging CLIP to predict Wölfflin’s principles in visual art,

A. Ghildyal, L.-Y. Wang, and F. Liu, “WP-CLIP: Leveraging CLIP to predict Wölfflin’s principles in visual art,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), 2025, pp. 396–405

work page 2025
[19]

StyleBabel: Artistic style tagging and captioning,

D. Ruta, A. Gilbert, P. Aggarwal, N. Marri, A. Kale, J. Briggs, C. Speed, H. Jin, B. Faieta, A. Filipkowski, Z. Lin, and J. Collomosse, “StyleBabel: Artistic style tagging and captioning,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 219–236

work page 2022
[20]

Lever- aging knowledge graphs and deep learning for automatic art analysis,

G. Castellano, V. Digeno, G. Sansaro, and G. Vessio, “Lever- aging knowledge graphs and deep learning for automatic art analysis,” Knowl.-Based Syst., vol. 248, p. 108859, 2022

work page 2022
[21]

GNNBoost: Boosting artwork classification with graph embeddings,

C. B. El Vaigh, N. Garcia, B. Renoust, C. Chu, Y. Nakashima, Y. Qian, and H. Nagahara, “GNNBoost: Boosting artwork classification with graph embeddings,” Multimedia Tools Appl., vol. 84, pp. 39353–39373, 2025

work page 2025
[22]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Mach. Intell., vol. 1, no. 5, pp. 206–215, 2019

work page 2019
[23]

Wölfflin, Principles of Art History: The Problem of the Development of Style in Later Art

H. Wölfflin, Principles of Art History: The Problem of the Development of Style in Later Art. Dover, 1950

work page 1950
[24]

An image is worth 16x16 words: Transformers for image recog- nition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn et al., “An image is worth 16x16 words: Transformers for image recog- nition at scale,” in Proc. Int. Conf. Learn. Representations (ICLR), 2021

work page 2021
[25]

Sentence-BERT: Sentence embeddings using siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2019, pp. 3982–3992

work page 2019
[26]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021

work page 2021
[27]

Efficient and robust ap- proximate nearest neighbor search using hierarchical navigable small world graphs,

Y. A. Malkov and D. A. Yashunin, “Efficient and robust ap- proximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020. VOLUME 11, 2023 15

work page 2020
[28]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, p. 6, 2020

work page 2020
[29]

Translating embeddings for modeling multi- relational data,

A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi- relational data,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 26, 2013

work page 2013
[30]

Complex embeddings for simple link prediction,

T.Trouillon,J.Welbl,S.Riedel,É.Gaussier,andG.Bouchard, “Complex embeddings for simple link prediction,” in Proc. Int. Conf. Mach. Learn. (ICML), 2016, pp. 2071–2080

work page 2016
[31]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, 2022, pp. 22199– 22213

work page 2022
[32]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, 2022, pp. 24824–24837

work page 2022
[33]

CLIP-Art: Contrastive pre-training for fine-grained art classification,

M. V. Conde and K. Turgutlu, “CLIP-Art: Contrastive pre-training for fine-grained art classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2021, pp. 3956–3960

work page 2021
[34]

Siamese neural networks for content-based visual art recommendation,

R. Li, M. Moh, and T.-S. Moh, “Siamese neural networks for content-based visual art recommendation,” in Proc. 17th Int. Conf. Ubiquitous Inf. Manage. Commun. (IMCOM), 2023

work page 2023
[35]

MoRA: LoRA guided multi-modal disease diagnosis with missing modality,

Z. Shi, J. Kim, W. Li, Y. Li, and H. Pfister, “MoRA: LoRA guided multi-modal disease diagnosis with missing modality,” in Proc. Med. Image Comput. Comput. Assisted Intervention (MICCAI), 2024, pp. 273–282

work page 2024
[36]

Task-specific directions: Definition, exploration, and utiliza- tion in parameter efficient fine-tuning,

C. Si, Z. Shi, S. Zhang, X. Yang, H. Pfister, and W. Shen, “Task-specific directions: Definition, exploration, and utiliza- tion in parameter efficient fine-tuning,” IEEE Trans. Pattern Anal. Mach. Intell., 2026

work page 2026
[37]

Generalized tensor-based parameter-efficient fine-tuning via Lie group transformations,

C. Si, Z. Shi, X. Wang, Y. Xiao, X. Yang, and W. Shen, “Generalized tensor-based parameter-efficient fine-tuning via Lie group transformations,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025

work page 2025
[38]

DualEdit: Dual editing for knowledge updating in vision- language models,

Z. Shi, B. Wang, C. Si, Y. Wu, J. Kim, and H. Pfister, “DualEdit: Dual editing for knowledge updating in vision- language models,” in Proc. Conf. Lang. Model. (COLM), 2025. HANYI LIU Hanyi Liu received the B.S. degree from Southeast University, Nanjing, China, and the M.A. degree from the Royal College of Art, London, U.K. She is cur- rently a researcher ...

work page 2025

[1] [1]

Data science and digital art history,

L. Manovich, “Data science and digital art history,” Int. J. Digital Art History, no. 1, pp. 10–35, 2015

work page 2015

[2] [2]

The shape of art history in the eyes of the machine,

A. Elgammal, B. Liu, D. Kim, M. Elhoseiny, and M. Mazzone, “The shape of art history in the eyes of the machine,” in Proc. 32nd AAAI Conf. Artif. Intell., 2018

work page 2018

[3] [3]

GalleryGPT: Analyzing paintings with large multimodal models,

Y. Bin, W. Shi, Y. Ding, Z. Hu, Z. Wang, Y. Yang, S.-K. Ng, and H. T. Shen, “GalleryGPT: Analyzing paintings with large multimodal models,” in Proc. 32nd ACM Int. Conf. Multimedia (MM), 2024, pp. 7734–7743

work page 2024

[4] [4]

Diffusion based augmentation for captioning and retrieval in cultural heritage,

D. Cioni, L. Berlincioni, F. Becattini, and A. Del Bimbo, “Diffusion based augmentation for captioning and retrieval in cultural heritage,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), 2023

work page 2023

[5] [5]

Caption generation in cultural heritage: Crowdsourced data and tuning multimodal large language models,

A. Reshetnikov and M.-C. Marinescu, “Caption generation in cultural heritage: Crowdsourced data and tuning multimodal large language models,” in Proc. 1st Workshop Lang. Models Underserved Communities (LM4UC), 2025, pp. 42–50

work page 2025

[6] [6]

Multimodal metadata assignment for cultural heritage arti- facts,

L. Rei, D. Mladenić, M. Dorozynski, F. Rottensteiner, T. Schleider, R. Troncy, J. S. Lozano, and M. G. Salvatella, “Multimodal metadata assignment for cultural heritage arti- facts,” Multimedia Systems, vol. 29, pp. 847–869, 2023

work page 2023

[7] [7]

Towards cross-modal retrieval in chinese cultural heritage documents: Dataset and solution,

J. Yuan, J. Zhang, F. Wu, D. Lu, H. Lu, and Q. Wang, “Towards cross-modal retrieval in chinese cultural heritage documents: Dataset and solution,” in Proc. Int. Conf. Docu- ment Anal. Recognit. (ICDAR). Springer, 2025, pp. 570–586

work page 2025

[8] [8]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748–8763

work page 2021

[9] [9]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proc. Int. Conf. Mach. Learn. (ICML), 2023, pp. 19730–19742

work page 2023

[10] [10]

A survey on knowledge- enhanced multimodal learning,

M. Lymperaiou and G. Stamou, “A survey on knowledge- enhanced multimodal learning,” Artif. Intell. Rev., vol. 57, p. 284, 2024

work page 2024

[11] [11]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,” Front. Comput. Sci., vol. 18, no. 6, p. 186345, 2024

work page 2024

[12] [12]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” in Proc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023

[13] [13]

Adaptation of agentic AI: A survey of post-training, memory, and skills.arXiv preprint arXiv:2512.16301, 2025

P. Jiang, J. Lin, Z. Shi, Z. Wang, L. He, Y. Wu, M. Zhong, P. Song, Q. Zhang et al., “Adaptation of agentic AI,” arXiv preprint arXiv:2512.16301, 2025

work page arXiv 2025

[14] [14]

Benchmarking vision language models for cultural understanding,

S. Nayak, K. Jain, R. Awal, S. Reddy, S. V. Steenkiste, L. A. Hendricks, K. Stanczak, and A. Agrawal, “Benchmarking vision language models for cultural understanding,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2024, pp. 5769–5790

work page 2024

[15] [15]

Pearl, Causality: Models, Reasoning, and Inference, 2nd ed

J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, 2009

work page 2009

[16] [16]

Toward automated discovery of artistic influence,

B. Saleh, K. Abe, R. S. Arora, and A. Elgammal, “Toward automated discovery of artistic influence,” Multimedia Tools Appl., vol. 75, no. 7, pp. 3565–3591, 2016

work page 2016

[17] [17]

Quantifying creativity in art networks,

A. Elgammal and B. Saleh, “Quantifying creativity in art networks,” in Proc. 6th Int. Conf. Comput. Creativity (ICCC), 2015, pp. 39–46

work page 2015

[18] [18]

WP-CLIP: Leveraging CLIP to predict Wölfflin’s principles in visual art,

A. Ghildyal, L.-Y. Wang, and F. Liu, “WP-CLIP: Leveraging CLIP to predict Wölfflin’s principles in visual art,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), 2025, pp. 396–405

work page 2025

[19] [19]

StyleBabel: Artistic style tagging and captioning,

D. Ruta, A. Gilbert, P. Aggarwal, N. Marri, A. Kale, J. Briggs, C. Speed, H. Jin, B. Faieta, A. Filipkowski, Z. Lin, and J. Collomosse, “StyleBabel: Artistic style tagging and captioning,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 219–236

work page 2022

[20] [20]

Lever- aging knowledge graphs and deep learning for automatic art analysis,

G. Castellano, V. Digeno, G. Sansaro, and G. Vessio, “Lever- aging knowledge graphs and deep learning for automatic art analysis,” Knowl.-Based Syst., vol. 248, p. 108859, 2022

work page 2022

[21] [21]

GNNBoost: Boosting artwork classification with graph embeddings,

C. B. El Vaigh, N. Garcia, B. Renoust, C. Chu, Y. Nakashima, Y. Qian, and H. Nagahara, “GNNBoost: Boosting artwork classification with graph embeddings,” Multimedia Tools Appl., vol. 84, pp. 39353–39373, 2025

work page 2025

[22] [22]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Mach. Intell., vol. 1, no. 5, pp. 206–215, 2019

work page 2019

[23] [23]

Wölfflin, Principles of Art History: The Problem of the Development of Style in Later Art

H. Wölfflin, Principles of Art History: The Problem of the Development of Style in Later Art. Dover, 1950

work page 1950

[24] [24]

An image is worth 16x16 words: Transformers for image recog- nition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn et al., “An image is worth 16x16 words: Transformers for image recog- nition at scale,” in Proc. Int. Conf. Learn. Representations (ICLR), 2021

work page 2021

[25] [25]

Sentence-BERT: Sentence embeddings using siamese BERT-networks,

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2019, pp. 3982–3992

work page 2019

[26] [26]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, 2021

work page 2021

[27] [27]

Efficient and robust ap- proximate nearest neighbor search using hierarchical navigable small world graphs,

Y. A. Malkov and D. A. Yashunin, “Efficient and robust ap- proximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020. VOLUME 11, 2023 15

work page 2020

[28] [28]

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,

D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, p. 6, 2020

work page 2020

[29] [29]

Translating embeddings for modeling multi- relational data,

A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi- relational data,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 26, 2013

work page 2013

[30] [30]

Complex embeddings for simple link prediction,

T.Trouillon,J.Welbl,S.Riedel,É.Gaussier,andG.Bouchard, “Complex embeddings for simple link prediction,” in Proc. Int. Conf. Mach. Learn. (ICML), 2016, pp. 2071–2080

work page 2016

[31] [31]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, 2022, pp. 22199– 22213

work page 2022

[32] [32]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, 2022, pp. 24824–24837

work page 2022

[33] [33]

CLIP-Art: Contrastive pre-training for fine-grained art classification,

M. V. Conde and K. Turgutlu, “CLIP-Art: Contrastive pre-training for fine-grained art classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), 2021, pp. 3956–3960

work page 2021

[34] [34]

Siamese neural networks for content-based visual art recommendation,

R. Li, M. Moh, and T.-S. Moh, “Siamese neural networks for content-based visual art recommendation,” in Proc. 17th Int. Conf. Ubiquitous Inf. Manage. Commun. (IMCOM), 2023

work page 2023

[35] [35]

MoRA: LoRA guided multi-modal disease diagnosis with missing modality,

Z. Shi, J. Kim, W. Li, Y. Li, and H. Pfister, “MoRA: LoRA guided multi-modal disease diagnosis with missing modality,” in Proc. Med. Image Comput. Comput. Assisted Intervention (MICCAI), 2024, pp. 273–282

work page 2024

[36] [36]

Task-specific directions: Definition, exploration, and utiliza- tion in parameter efficient fine-tuning,

C. Si, Z. Shi, S. Zhang, X. Yang, H. Pfister, and W. Shen, “Task-specific directions: Definition, exploration, and utiliza- tion in parameter efficient fine-tuning,” IEEE Trans. Pattern Anal. Mach. Intell., 2026

work page 2026

[37] [37]

Generalized tensor-based parameter-efficient fine-tuning via Lie group transformations,

C. Si, Z. Shi, X. Wang, Y. Xiao, X. Yang, and W. Shen, “Generalized tensor-based parameter-efficient fine-tuning via Lie group transformations,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025

work page 2025

[38] [38]

DualEdit: Dual editing for knowledge updating in vision- language models,

Z. Shi, B. Wang, C. Si, Y. Wu, J. Kim, and H. Pfister, “DualEdit: Dual editing for knowledge updating in vision- language models,” in Proc. Conf. Lang. Model. (COLM), 2025. HANYI LIU Hanyi Liu received the B.S. degree from Southeast University, Nanjing, China, and the M.A. degree from the Royal College of Art, London, U.K. She is cur- rently a researcher ...

work page 2025