Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Pith reviewed 2026-05-10 17:45 UTC · model grok-4.3
The pith
Vision-language models capture only fragmented signals when inferring structured cultural metadata from images and vary widely across regions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents Appear2Meaning, a cross-cultural benchmark for structured cultural metadata inference from images. Using an LLM-as-Judge framework to measure semantic alignment with reference annotations, it reports exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
What carries the argument
The Appear2Meaning benchmark, which supplies images tagged by cultural region and requires models to infer multiple structured metadata attributes, scored via LLM-based semantic alignment against expert references.
If this is right
- Models must move beyond isolated visual cues to build integrated cultural reasoning if they are to handle structured metadata tasks consistently.
- Large gaps in accuracy by region and attribute type indicate that training data and architectures need targeted adjustments for underrepresented cultures.
- Automated heritage metadata systems built on current VLMs will produce unreliable results until the observed inconsistencies are addressed.
- Attribute-level breakdowns can serve as diagnostic tools to identify which metadata categories are most in need of model improvement.
Where Pith is reading between the lines
- The benchmark format could be reused to test whether similar fragmentation appears in other multimodal tasks that require historical or contextual knowledge.
- Persistent regional differences may reflect imbalances in the visual and textual data used to train existing vision-language models.
- Pairing the automated judge with periodic human expert checks could reduce the risk that evaluation itself embeds cultural skew.
Load-bearing premise
The LLM-as-Judge framework reliably measures semantic alignment with reference annotations without introducing its own cultural or interpretive biases.
What would settle it
A follow-up study in which human cultural experts directly rate the same model outputs and produce accuracy scores that match or diverge sharply from the LLM judge results across the full set of cultures and metadata types.
Figures
read the original abstract
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Appear2Meaning, a multi-category cross-cultural benchmark for inferring structured cultural metadata (e.g., creator, origin, period) from images. It evaluates VLMs via an LLM-as-Judge framework that scores semantic alignment with reference annotations, reporting exact-match, partial-match, and attribute-level accuracy across cultural regions and metadata types. The central claim is that current VLMs capture only fragmented signals, exhibit substantial performance variation, and produce inconsistent, weakly grounded predictions.
Significance. If the evaluation is robust, the benchmark would be a useful contribution to computer vision and cultural AI by providing a structured testbed for cultural reasoning capabilities beyond standard captioning. The cross-cultural design and multi-level accuracy metrics could help surface limitations in VLMs for heritage applications.
major comments (2)
- [LLM-as-Judge framework and Experiments] The headline claims about fragmented signals and cultural performance gaps rest entirely on LLM-as-Judge semantic alignment scores. No calibration data, human inter-annotator agreement, or judge-model ablation is reported, leaving open the possibility that the judge's own training distribution introduces systematic cultural bias that inflates the observed gaps (see the LLM-as-Judge framework description and the Experiments section).
- [Abstract and Results] The abstract states that results show 'substantial performance variation' but supplies neither dataset size, the list of evaluated VLMs, nor any numerical accuracy values. Without these, the magnitude and statistical reliability of the cross-cultural claims cannot be assessed (see Results section and any accompanying tables).
minor comments (1)
- [Abstract] The abstract would benefit from one or two concrete accuracy numbers or dataset scale figures to give readers an immediate sense of the effect sizes.
Simulated Author's Rebuttal
Thank you for the detailed and constructive referee report. We address each major comment below and indicate the revisions we intend to make to strengthen the manuscript.
read point-by-point responses
-
Referee: [LLM-as-Judge framework and Experiments] The headline claims about fragmented signals and cultural performance gaps rest entirely on LLM-as-Judge semantic alignment scores. No calibration data, human inter-annotator agreement, or judge-model ablation is reported, leaving open the possibility that the judge's own training distribution introduces systematic cultural bias that inflates the observed gaps (see the LLM-as-Judge framework description and the Experiments section).
Authors: We agree that validating the LLM-as-Judge framework is important for supporting the central claims. The manuscript describes the prompting approach and semantic alignment scoring but does not report calibration data, inter-annotator agreement, or model ablations. In the revision we will expand the Experiments section with a new subsection on judge robustness. This will include: discussion of the specific judge model used, consistency checks via repeated prompting on a sample of cases, and explicit acknowledgment of possible cultural biases arising from the judge's training data. We will also add a Limitations paragraph noting that comprehensive human IAA and full ablations were not performed in the current study and constitute valuable future work. The reference annotations themselves were produced by regional cultural experts, and the judge is prompted strictly to measure alignment with those references rather than to perform independent cultural inference. revision: partial
-
Referee: [Abstract and Results] The abstract states that results show 'substantial performance variation' but supplies neither dataset size, the list of evaluated VLMs, nor any numerical accuracy values. Without these, the magnitude and statistical reliability of the cross-cultural claims cannot be assessed (see Results section and any accompanying tables).
Authors: We thank the referee for highlighting this presentational issue. While the Results section and tables already contain the dataset size, the full list of evaluated VLMs, and the numerical accuracy values (exact-match, partial-match, and attribute-level) that demonstrate the reported variation, the abstract itself remains high-level. In the revised version we will update the abstract to incorporate these key quantitative details so that readers can immediately gauge the scale and reliability of the cross-cultural performance gaps without first consulting the full paper. revision: yes
Circularity Check
No significant circularity: empirical benchmark with external annotations
full rationale
The paper introduces a cross-cultural benchmark for structured metadata inference and evaluates VLMs via LLM-as-Judge against reference annotations using exact-match, partial-match, and attribute-level accuracy. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct comparison to external human annotations rather than reducing to inputs by construction. Self-citations, if present, are not load-bearing for the central results. This is a standard empirical evaluation setup with no circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reference annotations constitute accurate ground truth for cultural metadata attributes
- domain assumption LLM-as-Judge produces unbiased semantic alignment scores
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a multi-category, cross-cultural benchmark... evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Manar Abu Talib, Iman Ibrahim, and Manar Anwer Abusirdaneh. 2026. Reusabil- ity and Benchmarking Potential of Architectural Cultural Heritage Datasets for Generative AI: An Analytical Study.Expert Systems With Applications(2026). Published online 16 January 2026
work page 2026
-
[2]
Mistral AI. 2024. Pixtral: Mistral’s Vision-Language Models. https://mistral.ai
work page 2024
-
[3]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page 2022
-
[4]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2018
-
[5]
Hannah Andrews and Aurora Hawcroft. 2024. Articulating arts-led AI: artists and technological development in cultural policy.European Journal of Cultural Management and Policy(2024)
work page 2024
-
[6]
Anthropic. 2025. Claude 4 Model Family. https://www.anthropic.com
work page 2025
-
[7]
2016.Introduction to Metadata(3 ed.)
Murtha Baca (Ed.). 2016.Introduction to Metadata(3 ed.). Getty Research Institute, Los Angeles. https://www.getty.edu/publications/intrometadata/
work page 2016
-
[8]
Longju Bai, Angana Borah, Oana Ignat, and Rada Mihalcea. 2025. The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning. InPro- ceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2025). Association for Computational Linguistics
work page 2025
-
[9]
Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See- Kiong Ng, and Heng Tao Shen. 2024. GalleryGPT: Analyzing Paintings with Large Multimodal Models. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). ACM, Melbourne, Australia
work page 2024
-
[10]
Eva Cetinic. 2021. Towards Generating and Evaluating Iconographic Image Captions of Artworks.Journal of Imaging7, 7 (2021), 123
work page 2021
-
[11]
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, and Baosheng He. 2025. CompCap: Improving Multimodal Large Language Models with Composite Captions. InICCV. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Jiang et al
work page 2025
-
[12]
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. 2023. PaLI: A Jointly-Scaled Multilingual Language-Image Model. InThe Eleventh International Conference on Learning Representations
work page 2023
-
[13]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Represen- tation Learning. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 104–120. doi:10...
-
[14]
Dario Cioni, Lorenzo Berlincioni, Federico Becattini, and Alberto Del Bimbo
-
[15]
InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023)
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW 2023). IEEE
work page 2023
-
[16]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: towards general-purpose vision-language models with instruction tuning. InPro- ceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran As...
work page 2023
-
[17]
Martin Doerr. 2003. The CIDOC Conceptual Reference Model: An Ontological Approach to Semantic Interoperability of Metadata.International Journal of Human-Computer Studies43, 5 (2003), 75–92. doi:10.1016/j.ijhcs.2003.10.003
-
[18]
Emma Duester. 2024. Digital art work and AI: a new paradigm for work in the contemporary art sector in China.European Journal of Cultural Management and Policy(2024)
work page 2024
- [19]
-
[20]
Marco Fiorucci et al. 2020. Machine Learning for Cultural Heritage: A Survey. Pattern Recognition Letters(2020). https://doi.org/10.1016/j.patrec.2020.02.017
-
[21]
Sophie Frost and Lauren Vargas. 2025. Cultural work, wellbeing, and AI.Euro- pean Journal of Cultural Management and Policy(2025)
work page 2025
-
[22]
1973.The Interpretation of Cultures: Selected Essays
Clifford Geertz. 1973.The Interpretation of Cultures: Selected Essays. Basic Books, New York
work page 1973
-
[23]
Alibaba Group. 2024. Qwen-VL: A Versatile Vision-Language Model. https://github.com/QwenLM/Qwen-VL
work page 2024
-
[24]
Alibaba Group. 2025. Qwen3-VL Technical Report. https://github.com/QwenLM
work page 2025
-
[25]
Patricia Harpring. 2010.Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Getty Research Institute, Los Angeles
work page 2010
-
[26]
Yanbei Jiang, Krista A. Ehinger, and Jey Han Lau. 2024. KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI- 24), Special Track on AI, the Arts and Creativity. IJCAI
work page 2024
-
[27]
H. Lee. 2025. Lost in Translation: Probing Cultural Bias in Vision-Language Models. InICCV Workshop. https://openaccess.thecvf.com/content/ICCV2025W/ WCCA/papers/Lee_Lost_in_Translation_A_Position_Paper_on_Probing_ Cultural_Bias_ICCVW_2025_paper.pdf
work page 2025
-
[28]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: bootstrap- ping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 814, 13 pages
work page 2023
-
[29]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrap- ping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaud- huri, Stefanie Jegelka, Le Song, Csaba Szepesvari...
work page 2022
-
[30]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao
-
[31]
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part XXX(Glasgow, United Kingdom). Springer- Verlag, Berlin, Heidelberg, 121–137. doi:10.1007/978-3-030-58577-8_8
-
[32]
Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. 2025. Describe anything: De- tailed localized image and video captioning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 21766–21777
work page 2025
-
[33]
Hong, Jiatao Gu, and Chris Callison-Burch
Xiaoyu Lin, Aniket Ghorpade, Hansheng Zhu, Justin Qiu, Dea Rrozhani, Monica Lama, Mick Yang, Zixuan Bian, Ruohan Ren, Alan B. Hong, Jiatao Gu, and Chris Callison-Burch. 2025. DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions.arXiv preprint arXiv:2511.12452(November 2025)
-
[34]
Fang Liu, Mohan Zhang, Baoying Zheng, Shenglan Cui, Wentao Ma, and Zhix- iong Liu. 2023. Feature Fusion via Multi-Target Learning for Ancient Artwork Captioning.Information Fusion97 (2023)
work page 2023
-
[35]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY , USA, Article 1516, 25 pages
work page 2023
-
[36]
Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, and Jindong Wang. 2025. CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries. arXiv:2501.01282 [cs.AI] https://arxiv.org/abs/2501.01282
- [37]
-
[38]
Yue Lu, Chao Guo, Xingyuan Dai, and Fei-Yue Wang. 2024. ArtCap: A Dataset for Image Captioning of Fine Art Paintings.IEEE Transactions on Computational Social Systems(2024)
work page 2024
-
[39]
Oonagh Murphy and Elena Villaespesa. 2020. AI: A Museum Planning Toolkit
work page 2020
-
[40]
2008.Classification and Codes for Cultural Relics
National Cultural Heritage Administration of China. 2008.Classification and Codes for Cultural Relics. Cultural Relics Press, Beijing
work page 2008
-
[41]
OpenAI. 2025. GPT-4.1 Mini. https://openai.com
work page 2025
-
[42]
OpenAI. 2026. GPT-5.4. https://openai.com
work page 2026
-
[43]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine...
work page 2021
-
[44]
Artem Reshetnikov and Maria-Cristina Marinescu. 2025. Caption Generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models. InProceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025). Association for Computational Linguistics, 42–50
work page 2025
-
[45]
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2019. Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain. InProceedings of the International Conference on Image Analysis and Processing (ICIAP 2019). Springer
work page 2019
-
[46]
Chen, Hwaran Lee, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee
Bryan Chen Zhengyu Tan, Zheng Weihua, Zhengyuan Liu, Nancy F. Chen, Hwaran Lee, Kenny Tsu Wei Choo, and Roy Ka-Wei Lee. 2026. BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models. arXiv:2510.11178 [cs.CV] https://arxiv.org/abs/2510.11178
-
[47]
Elena Villaespesa and Seth Crider. 2021. Computer Vision Tagging the Metropoli- tan Museum of Art’s Collection: A Comparison of Three Systems.Journal on Computing and Cultural Heritage(2021)
work page 2021
-
[48]
Elena Villaespesa and Seth Crider. 2021. A critical comparison analysis be- tween human and machine-generated tags for the Metropolitan Museum of Art’s collection.Journal of Documentation(2021)
work page 2021
-
[49]
Elena Villaespesa and Oonagh Murphy. 2021. This is not an apple! Benefits and challenges of applying computer vision to museum collections.Museum Management and Curatorship(2021)
work page 2021
-
[50]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and Tell: A Neural Image Caption Generator. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR)
work page 2015
-
[51]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. OFA: Unifying Architec- tures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. InProceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. ...
work page 2022
-
[53]
Paula Westenberger and Despoina Farmaki. 2025. Artificial intelligence for cul- tural heritage research: the challenges in UK copyright law and policy.European Journal of Cultural Management and Policy(2025)
work page 2025
-
[54]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. InProceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 37), Francis Bach and David Blei (Eds...
work page 2015
- [55]
-
[56]
Cheng Zhang, Hongxia Xie, Bin Wen, Songhan Zuo, Ruoxuan Zhang, and Wen- Huang Cheng. 2025. EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation. InProceedings of the 33rd ACM International Conference Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from ImagesACM MM ’26, November 10–14, 2026, Rio de...
-
[57]
Baoying Zheng, Fang Liu, Mohan Zhang, Tongqing Zhou, Shenglan Cui, Yunfan Ye, and Yeting Guo. 2023. Image Captioning for Cultural Artworks: A Case Study on Ceramics.Multimedia Systems29 (2023), 3223–3243
work page 2023
-
[58]
Ping Zhong, Wenjin Hu, Yinqiu Zhao, and Fujun Zhang. 2026. Geo-TCAM: A Thangka Captioning Method Integrating Topic Modeling with Geometry-Guided Spatial Attention.npj Heritage Science14 (2026), 87. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Jiang et al. A Case Studies and Error Analysis We analyze prediction outputs across models and identif...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.