arxiv: 2604.19689 · v1 · submitted 2026-04-21 · 💻 cs.AI

Recognition: unknown

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Shuai Wang , Hongyi Zhu , Jia-Hong Huang , Yixian Shen , Chengxi Zeng , Stevan Rudinac , Monika Kackovic , Nachoem Wijnberg

show 1 more author

Marcel Worring

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent-based multimodal retrievalstructured reasoning plansartwork explanationevidence groundingArtCoT-QA benchmarkmultimodal large language modelsfine-grained art understanding

0 comments

The pith

Agent-generated reasoning plans guide targeted retrieval to produce more grounded explanations of artworks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that decomposing an art query into an explicit structured reasoning plan, then conditioning multimodal retrieval on that plan, yields step-wise explanations with stronger evidence grounding than implicit reasoning in standard multimodal models. A sympathetic reader would care because current systems often produce plausible but untraceable answers on cultural topics where accuracy depends on specific visual, historical, and stylistic details. The approach introduces the ArtCoT-QA benchmark to measure multi-step reasoning chains rather than final-answer accuracy alone. Experiments on SemArt and Artpedia show consistent gains in explanation quality, evidence grounding, and reasoning ability over non-planned retrieval and strong baselines.

Core claim

Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditioned on this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. This produces higher-quality final explanations on SemArt and Artpedia while demonstrating clear advantages in evidence grounding and multi-step reasoning on the new ArtCoT-QA benchmark.

What carries the argument

Structured reasoning plans that specify per-step goals and evidence needs, used to condition the multimodal retrieval process.

If this is right

A-MAR produces higher-quality final explanations than static retrieval or standard multimodal baselines on SemArt and Artpedia.
It improves evidence grounding and multi-step reasoning ability when measured on ArtCoT-QA.
The method supports interpretable, goal-driven retrieval rather than relying solely on internalized model knowledge.
It is positioned as relevant for knowledge-intensive multimodal tasks in cultural domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit plans could make it easier to audit or correct AI reasoning in other knowledge-rich domains such as history or material culture.
If plan quality can be measured independently, the framework might support iterative improvement where weak steps are regenerated before retrieval.
The same conditioning idea could be tested on non-art visual domains that require external context, such as scientific diagrams or archaeological finds.

Load-bearing premise

The agent can reliably generate structured reasoning plans that correctly identify needed evidence without introducing errors that degrade retrieval quality.

What would settle it

A controlled test on ArtCoT-QA in which retrieval uses the same model but ignores or replaces the generated reasoning plans, and the resulting explanations show equal or better grounding and multi-step accuracy than the full A-MAR system.

Figures

Figures reproduced from arXiv: 2604.19689 by Chengxi Zeng, Hongyi Zhu, Jia-Hong Huang, Marcel Worring, Monika Kackovic, Nachoem Wijnberg, Shuai Wang, Stevan Rudinac, Yixian Shen.

**Figure 2.** Figure 2: Overview of A-MAR framework for agent-based multimodal artwork reasoning. Given a multimodal artwork query, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: ArtCoT-QA dataset construction and evaluation protocol. Starting from curated artwork images, metadata, and textual [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: A-MAR autonomously plans how to explain a painting under an open-ended SemArt prompt, retrieves targeted [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: ArtCoT-VQA qualitative example illustrating step-level reasoning alignment between A-MAR and ground-truth [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A-MAR adds explicit agent planning to condition multimodal retrieval for art explanations and releases a new diagnostic benchmark, but the causal contribution of the plans is not isolated.

read the letter

A-MAR adds explicit agent planning to condition multimodal retrieval for art explanations and releases a new diagnostic benchmark, but the causal contribution of the plans is not isolated. The framework decomposes a query into a structured reasoning plan that lists goals and evidence needs per step, then retrieves targeted multimodal evidence instead of relying on the MLLM's internalized knowledge. This produces step-wise, grounded explanations. They also introduce ArtCoT-QA, which supplies multi-step reasoning chains for art queries so evaluation can look at evidence grounding and reasoning quality separately from final answer accuracy. Experiments on SemArt and Artpedia report better explanation quality than static retrieval and strong MLLM baselines, with further gains on the new benchmark. The code and data are released, which makes the work checkable. The main limitation is that the paper does not measure plan generation accuracy or run an oracle-plan ablation. Without those checks it is unclear whether the reported gains come from the planning mechanism itself or simply from extra LLM calls and larger retrieval sets. The abstract gives no numerical results or failure-mode analysis, so the size and reliability of the improvement are hard to judge from the summary alone. This work is aimed at researchers building multimodal systems for knowledge-intensive domains such as cultural heritage or any setting where explicit evidence grounding matters. Readers who want concrete ideas for agent-based decomposition will find usable pieces. It deserves a serious referee because the framework and benchmark are new and the public release lowers the barrier to verification. I would send it to peer review and ask specifically for plan-error rates and the oracle ablation to confirm that the planning step is doing the work claimed.

Referee Report

2 major / 2 minor

Summary. The paper proposes A-MAR, an agent-based multimodal art retrieval framework that first decomposes an artwork query into a structured reasoning plan specifying goals and evidence needs per step, then conditions retrieval on this plan to produce step-wise grounded explanations. It introduces the ArtCoT-QA diagnostic benchmark for multi-step art reasoning and reports that A-MAR outperforms static non-planned retrieval and strong MLLM baselines on SemArt and Artpedia for explanation quality, as well as on ArtCoT-QA for evidence grounding and reasoning ability. Code and data are released.

Significance. If the results hold after addressing validation gaps, the work advances interpretable multimodal reasoning by making retrieval explicitly goal-driven rather than implicit, with clear relevance to cultural heritage and fine-grained visual understanding tasks. The public release of code and the ArtCoT-QA benchmark is a concrete strength that supports reproducibility and further research.

major comments (2)

[§3.2] §3.2 (Reasoning Plan Generation): The central claim that conditioning retrieval on agent-generated structured plans produces superior targeted evidence requires direct validation of plan quality. No quantitative metrics are reported on plan accuracy (e.g., fraction of plans with incorrect evidence requirements or logical gaps on ArtCoT-QA examples), leaving open whether outperformance arises from the planning mechanism itself.
[§4.3] §4.3 (Ablation Studies): The manuscript lacks an oracle-plan ablation that replaces agent-generated plans with gold-standard plans while holding retrieval and generation steps fixed. Without this, it is impossible to isolate the causal role of the reasoning plan in the reported gains on explanation quality and grounding, as improvements could stem from extra LLM calls or retrieval volume.

minor comments (2)

[Abstract] Abstract: 'conditionedon' is missing a space and should read 'conditioned on'.
[§4] §4 (Experiments): The abstract and results sections claim 'consistent outperformance' without referencing specific quantitative deltas, tables, or statistical significance tests in the main text, which reduces clarity on the magnitude of improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our approach and indicating revisions to strengthen the validation of the reasoning plan mechanism.

read point-by-point responses

Referee: [§3.2] §3.2 (Reasoning Plan Generation): The central claim that conditioning retrieval on agent-generated structured plans produces superior targeted evidence requires direct validation of plan quality. No quantitative metrics are reported on plan accuracy (e.g., fraction of plans with incorrect evidence requirements or logical gaps on ArtCoT-QA examples), leaving open whether outperformance arises from the planning mechanism itself.

Authors: We agree that direct quantitative metrics on plan quality would provide stronger support for the central claim. While ArtCoT-QA's focus on evidence grounding and multi-step reasoning offers indirect validation through end-to-end performance, it does not explicitly measure plan correctness. In the revised manuscript, we will add a new analysis in §3.2 and §4 reporting plan accuracy on a manually annotated sample of ArtCoT-QA examples, including the fraction of plans with correct evidence requirements and logical structure, along with common error categories. This will help confirm the reliability of the planning step. revision: yes
Referee: [§4.3] §4.3 (Ablation Studies): The manuscript lacks an oracle-plan ablation that replaces agent-generated plans with gold-standard plans while holding retrieval and generation steps fixed. Without this, it is impossible to isolate the causal role of the reasoning plan in the reported gains on explanation quality and grounding, as improvements could stem from extra LLM calls or retrieval volume.

Authors: We concur that an oracle-plan ablation is the most direct way to isolate the causal effect of the planning component. In the revised §4.3, we will add this ablation by adapting the gold multi-step reasoning chains from ArtCoT-QA into our structured plan format to serve as oracles. Performance will be compared against agent-generated plans while fixing the retrieval and generation modules and matching retrieval volume across conditions. We will also report on any differences in computational overhead to address concerns about extra LLM calls. revision: yes

Circularity Check

0 steps flagged

No circularity: A-MAR method and evaluations are independent of inputs

full rationale

The paper presents A-MAR as a novel agent-based framework that first generates a structured reasoning plan from artwork and query, then conditions multimodal retrieval on that plan to produce grounded explanations. This is evaluated empirically on SemArt, Artpedia, and the newly introduced ArtCoT-QA benchmark using standard metrics for explanation quality, evidence grounding, and multi-step reasoning. No equations, fitted parameters, or self-citations are used to define the core mechanism; the claimed superiority is asserted via experimental comparisons to static retrieval and MLLM baselines rather than by construction or renaming. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is a systems proposal relying on standard AI techniques.

pith-pipeline@v0.9.0 · 5587 in / 1121 out tokens · 49039 ms · 2026-05-10T02:09:28.140338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 13 canonical work pages · 5 internal anchors

[1]

Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, and Leonidas Guibas. 2021. ArtEmis: Affective Language for Visual Art. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

2021
[2]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. InComputer Vision – ECCV 2016

2016
[3]

Anthropic. 2025. Introducing Claude Haiku 4.5. https://www.anthropic.com/ news/claude-haiku-4-5. 15-10-2025

2025
[4]

Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5. 29-09-2025

2025
[5]

Zechen Bai, Yuta Nakashima, and Noa García. 2021. Explain Me the Paint- ing: Multi-Topic Knowledgeable Art Description Generation.2021 IEEE/CVF International Conference on Computer Vision (ICCV)(2021), 5402–5412. https: //api.semanticscholar.org/CorpusID:237490413

2021
[6]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization

2005
[7]

1988.Painting and experience in fifteenth century Italy: a primer in the social history of pictorial style

Michael Baxandall. 1988.Painting and experience in fifteenth century Italy: a primer in the social history of pictorial style. Oxford Paperbacks

1988
[8]

Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See- Kiong Ng, and Heng Tao Shen. 2024. GalleryGPT: Analyzing Paintings with Large Multimodal Models. InProceedings of the 32nd ACM International Conference on Multimedia

2024
[9]

Tom et al. Brown. 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems

2020
[10]

Gustavo Carneiro, Nuno Pinho da Silva, Alessio Del Bue, and João Paulo Costeira
[11]

In Proceedings of the 12th European Conference on Computer Vision (ECCV’12)

Artistic image classification: an analysis on the PRINTART database. In Proceedings of the 12th European Conference on Computer Vision (ECCV’12)
[12]

Eva Cetinic. 2021. Iconographic Image Captioning for Artworks. InPattern Recognition. ICPR International Workshops and Challenges, Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani (Eds.)

2021
[13]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a- judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Forty-first International Conference on Machine Learning

2024
[14]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. 2024. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision

2024
[15]

Conde and Kerem Turgutlu

Marcos V. Conde and Kerem Turgutlu. 2021. CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, Nashville, TN, USA

2021
[16]

Elliot Crowley and Andrew Zisserman. 2014. The State of the Art: Object Retrieval in Paintings using Discriminative Regions. InProceedings of the British Machine Vision Conference. BMVA Press

2014
[17]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2025. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. https://arxiv.org/abs/2404.16130

work page internal anchor Pith review arXiv 2025
[18]

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and Marcel Worring. 2026. VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings.arXiv preprint arXiv:2603.02435(2026)

work page arXiv 2026
[19]

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Marcel Worring, and Nachoem Wijnberg. 2021. Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings. InProceedings of the 29th ACM International Conference on Multimedia

2021
[20]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24)

2024
[21]

Noa Garcia and George Vogiatzis. 2018. How to Read Paintings: Semantic Art Understanding with Multi-Modal Retrieval. InProceedings of the European Con- ference in Computer Vision Workshops

2018
[22]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.The Innovation(2024)

2024
[23]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China

2025
[24]

Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, and Taro Watanabe. [n. d.]. Towards Artwork Explanation in Large-scale Vision Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics
[25]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi
[26]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.)

CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). https://aclanthology.org/2021.emnlp-main.595

2021
[27]

de Vries, Maarten de Rijke, and Faegheh Hasibi

Mohanna Hoveyda, Harrie Oosterhuis, Arjen P. de Vries, Maarten de Rijke, and Faegheh Hasibi. 2025. Adaptive Orchestration of Modular Generative Information Access Systems. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25)

2025
[28]

Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, and Evangelos Kanoulas. 2025. Image2text2image: A novel framework for label-free evaluation of image-to-text generation with text-to-image diffusion models. InInternational Conference on Multimedia Modeling. Springer, 413–427

2025
[29]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Trans. Inf. Syst.(2025)

2025
[30]

Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

2021
[31]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.ACM Comput. Surv., Article 248 (March 2023), 38 pages

2023
[32]

Ehinger, and Jey Han Lau

Yanbei Jiang, Krista A. Ehinger, and Jey Han Lau. 2024. KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24. International Joint Conferences on Artificial Intelligence Organization

2024
[33]

Sergey Karayev, Matthew Trentacoste, Helen Han, Aseem Agarwala, Trevor Darrell, Aaron Hertzmann, and Holger Winnemoeller. 2014. Recognizing Image Style. InProceedings of the British Machine Vision Conference. BMVA Press

2014
[34]

Omar Shahbaz Khan, Ujjwal Sharma, Hongyi Zhu, Stevan Rudinac, and Björn Þór Jónsson. 2024. Exquisitor at the lifelog search challenge 2024: Blending conversa- tional search with user relevance feedback. InProceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge. 117–121

2024
[35]

Omar Shahbaz Khan, Hongyi Zhu, Ujjwal Sharma, Evangelos Kanoulas, Stevan Rudinac, and Björn Þór Jónsson. 2024. Exquisitor at the video browser showdown 2024: relevance feedback meets conversational search. InInternational Conference on Multimedia Modeling. Springer, 347–355

2024
[36]

Junlin Lee, Yequan Wang, Jing Li, and Min Zhang. 2024. Multimodal Reasoning with Multimodal Knowledge Graph. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand

2024
[37]

Dan Li, Shuai Wang, Jie Zou, Chang Tian, Elisha Nieuwburg, Fengyuan Sun, and Evangelos Kanoulas. 2021. Paint4Poem: A Dataset for Artistic Visualization of Classical Chinese Poems. arXiv:2109.11682 [cs.CV] https://arxiv.org/abs/2109. 11682

work page arXiv 2021
[38]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning

2023
[39]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out. Association for Computational Linguistics

2004
[40]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. InAdvances in Neural Information Processing Systems, Vol. 36

2023
[41]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore

2023
[42]

Mistral. 2025. Introducing Mistral 3. https://mistral.ai/news/mistral-3. 25-12- 2025

2025
[43]

Youssef Mohamed, Faizan Farooq Khan, Kilichbek Haydarov, and Mohamed Elhoseiny. 2022. It is Okay to Not Be Okay: Overcoming Emotional Bias in Affective Image Captioning by Contrastive Data Collection. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 21231–21240. doi:10.1109/CVPR52688.2022.02058

work page doi:10.1109/cvpr52688.2022.02058 2022
[44]

OpenAI. 2025. Introducing GPT -5.2. https://openai.com/index/introducing-gpt- 5-2/. 11-12-2025. ICMR ’26, June 16–19, 2026, Amsterdam, Netherlands Shuai Wang et al

2025
[45]

Panofsky

E. Panofsky. 1955.Meaning in the Visual Arts. University of Chicago Press. https://books.google.nl/books?id=Qsa00QEACAAJ

1955
[46]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics

2002
[47]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature machine intelligence1, 5 (2019), 206–215

2019
[49]

Shurong Sheng and Marie-Francine Moens. 2019. Generating Captions for Images of Ancient Artworks. InProceedings of the 27th ACM International Conference on Multimedia

2019
[50]

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agen- tic Retrieval-Augmented Generation: A Survey on Agentic RAG.ArXiv abs/2501.09136 (2025)

work page internal anchor Pith review arXiv 2025
[51]

Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2019. Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain. InImage Analysis and Processing – ICIAP 2019

2019
[52]

Gjorgji Strezoski, Lucas Fijen, Jonathan Mitnik, Dániel László, Pieter de Marez Oyens, Yoni Schirris, and Marcel Worring. 2020. TindART: A personal visual arts recommender. InProceedings of the 28th ACM International Conference on Multimedia

2020
[53]

Gjorgji Strezoski and Marcel Worring. 2018. OmniArt: A Large-scale Artistic Benchmark.ACM Trans. Multimedia Comput. Commun. Appl.(2018)

2018
[54]

Aguirre, and Kiyoshi Tanaka

Wei Ren Tan, Chee Seng Chan, Hernán E. Aguirre, and Kiyoshi Tanaka. 2016. Ceci n’est pas une pipe: A deep convolutional network for fine-art paintings classification. In2016 IEEE International Conference on Image Processing (ICIP). 3703–3707. doi:10.1109/ICIP.2016.7533051

work page doi:10.1109/icip.2016.7533051 2016
[55]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and Marcel Worring. 2025. ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25)

2025
[57]

Shuai Wang, Jiayi Shen, Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and Marcel Worring. 2024. Prototype-Enhanced Hypergraph Learning for Heterogeneous Information Networks. InMultiMedia Modeling

2024
[58]

Shuai Wang, David W Zhang, Jia-Hong Huang, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, and Marcel Worring. 2024. Ada-hgnn: Adaptive sampling for scalable hypergraph neural networks.arXiv preprint arXiv:2405.13372(2024)

work page arXiv 2024
[59]

Zixun Wu. 2022. Artwork interpretation.Master’s thesis, University of Melbourne (2022)

2022
[60]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[61]

Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, and Ruiquan Ge. 2025. CRISP-SAM2: SAM2 with Cross- Modal Interaction and Semantic Prompting for Multi-Organ Segmentation. arXiv:2506.23121 [eess.IV] https://arxiv.org/abs/2506.23121

work page arXiv 2025
[62]

Zheng Yuan, HU Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang
[63]

InProceedings of conference on language modelling

ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter. InProceedings of conference on language modelling
[64]

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-Augmented Generation for AI-Generated Content: A Survey.ArXivabs/2402.19473 (2024)

work page arXiv 2024
[65]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Cheng- wei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, and Shafiq Joty
[67]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Do Xuan Long, Cheng- wei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, and Shafiq R. Joty
[68]

ArXivabs/2303.10868 (2023)

Retrieving Multimodal Information for Augmented Generation: A Survey. ArXivabs/2303.10868 (2023)

work page arXiv 2023
[69]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2025. A Survey of Large Language Models. arXiv:2303.18223 [cs.CL] https://ar...

work page internal anchor Pith review arXiv 2025
[70]

Hongyi Zhu, Jia-Hong Huang, Stevan Rudinac, and Evangelos Kanoulas. 2024. Enhancing interactive image retrieval with query rewriting using large language models and vision language models. InProceedings of the 2024 International Conference on Multimedia Retrieval. 978–987

2024
[71]

Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Stevan Rudinac, and Evangelos Kanoulas. 2025. Interactive Image Retrieval Meets Query Rewriting with Large Language and Vision Language Models.ACM Trans. Multimedia Comput. Com- mun. Appl., Article 286 (Oct. 2025), 23 pages

2025