Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames

Delfina Sol Martinez Pandiani; Valentina Presutti

arxiv: 2110.07420 · v1 · submitted 2021-10-14 · 💻 cs.CV · cs.CL· cs.DL· cs.SI

Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames

Delfina Sol Martinez Pandiani , Valentina Presutti This is my paper

Pith reviewed 2026-05-24 13:30 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.DLcs.SI

keywords social conceptsmultimodal framesart imagesontologysemantic gapcultural heritagecomputer visioncognitive theories

0 comments

The pith

Social concepts in art images are modeled as multimodal frames by integrating multisensory data via a new ontology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper translates recent cognitive theories of social concept representation into a software method that represents concepts such as revolution or friendship as multimodal frames. This is achieved by extracting and combining features from multiple sensory modalities in art images tagged with the target concepts, then formalizing the result through a conceptual model and a novel ontology. The approach is tested as a proof of concept on images from the Tate Gallery collection. A sympathetic reader would care because it targets the semantic gap that makes automatic detection of abstract, non-physical concepts difficult in growing visual cultural heritage collections.

Core claim

Social concepts referring to non-physical objects can be automatically modeled by representing them as multimodal frames through the integration of multisensory data extracted from visual art material tagged with the concepts of interest, supported by a defined conceptual model and a novel ontology for formal representation, as demonstrated on the Tate Gallery collection.

What carries the argument

Multimodal frames, which integrate multisensory data to formally represent social concepts inside a novel ontology.

If this is right

Art image collections become indexable and queryable by social concepts rather than only concrete visual elements.
The ontology supplies a formal structure for encoding social concepts as multimodal frames.
The method supplies a concrete computational translation of cognitive theories into image analysis pipelines.
Proof-of-concept results on the Tate collection indicate the approach is feasible for cultural heritage applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frame-based representation could be tested on non-art images such as news photos or social media to check domain transfer.
The ontology might link to existing formal concept representations in knowledge graphs for richer querying.
Scaling the method to larger unlabeled corpora would require unsupervised ways to assign initial concept tags.

Load-bearing premise

That integrating multisensory data extracted from images tagged with social concepts will successfully bridge the semantic gap for non-physical social concepts.

What would settle it

A controlled comparison on the same tagged art images showing that multimodal frames yield no measurable improvement in representing or detecting social concepts over standard visual features alone would falsify the claim.

read the original abstract

Social concepts referring to non-physical objects--such as revolution, violence, or friendship--are powerful tools to describe, index, and query the content of visual data, including ever-growing collections of art images from the Cultural Heritage (CH) field. While much progress has been made towards complete image understanding in computer vision, automatic detection of social concepts evoked by images is still a challenge. This is partly due to the well-known semantic gap problem, worsened for social concepts given their lack of unique physical features, and reliance on more unspecific features than concrete concepts. In this paper, we propose the translation of recent cognitive theories about social concept representation into a software approach to represent them as multimodal frames, by integrating multisensory data. Our method focuses on the extraction, analysis, and integration of multimodal features from visual art material tagged with the concepts of interest. We define a conceptual model and present a novel ontology for formally representing social concepts as multimodal frames. Taking the Tate Gallery's collection as an empirical basis, we experiment our method on a corpus of art images to provide a proof of concept of its potential. We discuss further directions of research, and provide all software, data sources, and results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main deliverable is a new ontology for social concepts in art as multimodal frames plus open code on Tate images, but the validation stays at proof-of-concept level.

read the letter

The paper translates cognitive theories on social concepts into a formal ontology of multimodal frames and runs a proof-of-concept extraction pipeline on tagged Tate art images. It also ships the full software, data sources, and results openly. That combination is the useful part: the ontology gives a structured way to represent non-physical concepts like revolution or friendship, and the open release lets others test or extend the pipeline without starting from scratch. The focus on multisensory feature integration for concepts that lack clear visual anchors is a reasonable response to the semantic gap problem in cultural heritage collections. The work stays grounded in the stated motivation and does not overclaim empirical wins. The main limitation is that the description provides no quantitative metrics, error analysis, or baseline comparisons, so it is impossible to judge whether the multimodal integration actually improves detection over simpler approaches. The central assumption—that pulling together visual, textual, and other signals will reliably evoke the target social concepts—remains untested in the reported results. This paper is aimed at researchers building annotation tools or retrieval systems for abstract concepts in art and cultural heritage data. The open artifacts make it worth a referee's time even though the empirical section is preliminary; a serious review could clarify what the ontology adds beyond existing multimodal or frame-based representations and whether the pipeline scales.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes translating cognitive theories of social concept representation into a computational approach that models them as multimodal frames via integration of multisensory data extracted from tagged art images. It defines a conceptual model, introduces a novel ontology for formal representation of social concepts as multimodal frames, and reports a proof-of-concept experiment on the Tate Gallery collection, with all software, data, and results released openly.

Significance. If the ontology and integration pipeline hold, the work could help close the semantic gap for non-physical social concepts in visual collections, supporting improved indexing and retrieval in cultural heritage applications. The explicit release of software, data sources, and results is a clear strength that supports reproducibility and extension by others.

minor comments (2)

The abstract states that the method 'focuses on the extraction, analysis, and integration of multimodal features' but does not name the specific social concepts or feature types used in the Tate experiment; adding one concrete example would improve clarity without altering the central claim.
The paper frames the contribution as a conceptual model plus open resources rather than a performance benchmark; the discussion section should explicitly state the scope of the proof-of-concept to avoid reader expectations of quantitative metrics.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper frames its contribution as a proposal to translate existing cognitive theories into a new conceptual model and ontology for multimodal frames, followed by a proof-of-concept extraction pipeline on pre-tagged Tate images. No equations, parameter fitting, performance predictions, or derivation chains appear in the provided text. The central steps are definitional (introducing a model and ontology) rather than reductions of outputs to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked to force the result. This is a standard non-circular conceptual proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract invokes cognitive theories as background and introduces a new ontology without specifying numerical parameters or additional entities.

axioms (1)

domain assumption Social concepts lack unique physical features and rely on unspecific features, worsening the semantic gap.
Stated in the motivation section of the abstract as the core challenge.

invented entities (1)

multimodal frames ontology no independent evidence
purpose: Formally represent social concepts evoked by art images
Described as a novel ontology presented in the paper.

pith-pipeline@v0.9.0 · 5749 in / 1208 out tokens · 20214 ms · 2026-05-24T13:30:53.789188+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define a conceptual model and present a novel ontology for formally representing social concepts as multimodal frames... MUSCO ontology, based on the Descriptions and Situations (DnS) ontology
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extraction, analysis, and integration of multimodal features (including depicted concrete objects, depicted actions, and color features) from images tagged with social concepts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Last accessed: 2021-08-05

Dolce+dns ultralite (dul) ontology. Last accessed: 2021-08-05. URL: http://www.loa.istc.cnr.it/ontologies/DUL.owl

work page 2021
[2]

Integrating knowledge and reasoning in image understanding

Somak Aditya, Yezhou Yang, and Chitta Baral. Integrating knowledge and reasoning in image understanding. In International Joint Conference on Artificial Intelligence (IJCAI) , 2019

work page 2019
[3]

Abstract concept & emotion detection in tagged images with cnns

Youssef Ahres and Nikolaus Volk. Abstract concept & emotion detection in tagged images with cnns. 2016. URL: http://cs231n.stanford.edu/reports/2016/pdfs/008_Report.pdf

work page 2016
[4]

Distant viewing: analyzing large visual corpora

Taylor Arnold and Lauren Tilton. Distant viewing: analyzing large visual corpora. Digital Scholarship in the Humanities , 34(Supplement 1):i3--i16, 2019

work page 2019
[5]

Deep learning architectures for computer vision applications: a study

Randheer Bagi, Tanima Dutta, and Hari Prabhat Gupta. Deep learning architectures for computer vision applications: a study. In Advances in data and information sciences , pages 601--612. Springer, 2020

work page 2020
[6]

Camera Lucida: Reflections on Photography

Roland Barthes. Camera Lucida: Reflections on Photography . New York: Hill and Wang, 1981

work page 1981
[7]

Varieties of abstract concepts: development, use and representation in the brain

Anna M Borghi, Laura Barca, Ferdinand Binkofski, and Luca Tummolini. Varieties of abstract concepts: development, use and representation in the brain. Philosophical Transactions of the Royal Society , 2018

work page 2018
[8]

DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks

Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. ArXiv preprint :1410.8586 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

The art of detection

Elliot J Crowley and Andrew Zisserman. The art of detection. In European conference on computer vision , pages 721--737. Springer, 2016

work page 2016
[10]

Davis and Eiling Yee

Charles P. Davis and Eiling Yee. Building semantic memory from embodied and distributional language experience. WIREs Cognitive Science , e1555, 2021

work page 2021
[11]

A comparative approach between different computer vision tools, including commercial and open-source, for improving cultural image access and analysis

Jos \'e Luis Preza D \' az, Amelie Dorn, Gerda Koch, and Yalemisew Abgaz. A comparative approach between different computer vision tools, including commercial and open-source, for improving cultural image access and analysis. In 2020 10th International Conference on Advanced Computer Information Technologies (ACIT) , pages 815--819. IEEE, 2020

work page 2020
[12]

Understanding the semantic web through descriptions and situations

Aldo Gangemi and Peter Mika. Understanding the semantic web through descriptions and situations. In OTM Confederated International Conferences ``On the Move to Meaningful Internet Systems'' , pages 689--706. Springer, 2003

work page 2003
[13]

Concepts, control, and context: A connectionist account of normal and disordered semantic cognition

Paul Hoffman, James L McClelland, and Matthew A Lambon Ralph. Concepts, control, and context: A connectionist account of normal and disordered semantic cognition. Psychological review , 125(3):293, 2018

work page 2018
[14]

Improving object detection in art images using only style transfer

David Kadish, Sebastian Risi, and Anders Sundnes L vlie. Improving object detection in art images using only style transfer. ArXiv preprint :2102.06529 , 2021

work page arXiv 2021
[15]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123(1):32--73, 2017

work page 2017
[16]

Social roles and their descriptions

Claudio Masolo, Laure Vieu, Emanuele Bottazzi, Carola Catenacci, Roberta Ferrario, Aldo Gangemi, and Nicola Guarino. Social roles and their descriptions. In KR , pages 267--277, 2004

work page 2004
[17]

Representation of concepts as frames

Wiebke Petersen. Representation of concepts as frames . D \"u sseldorf University, 2015. Reprint

work page 2015
[18]

Indexing multimedia and creative works: the problems of meaning and interpretation

Pauline Rafferty and Rob Hidderley. Indexing multimedia and creative works: the problems of meaning and interpretation . Routledge, 2017

work page 2017
[19]

Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals

Nusrat J Shoumy, Li-Minn Ang, Kah Phooi Seng, DM Motiur Rahaman, and Tanveer Zia. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications , 149:102447, 2020

work page 2020
[20]

Evaluation of deep learning on an abstract image classification dataset

Sebastian Stabinger and Antonio Rodriguez-Sanchez. Evaluation of deep learning on an abstract image classification dataset. In Proceedings of the IEEE International Conference on Computer Vision Workshops , pages 2767--2772, 2017

work page 2017
[21]

Describing low-level image features using the comm ontology

Miroslav Vacura, Vojtech Sv \'a tek, Carsten Saathoff, Thomas Franz, and Rapha \"e l Troncy. Describing low-level image features using the comm ontology. In 2008 15th IEEE International Conference on Image Processing , pages 49--52. IEEE, 2008

work page 2008

[1] [1]

Last accessed: 2021-08-05

Dolce+dns ultralite (dul) ontology. Last accessed: 2021-08-05. URL: http://www.loa.istc.cnr.it/ontologies/DUL.owl

work page 2021

[2] [2]

Integrating knowledge and reasoning in image understanding

Somak Aditya, Yezhou Yang, and Chitta Baral. Integrating knowledge and reasoning in image understanding. In International Joint Conference on Artificial Intelligence (IJCAI) , 2019

work page 2019

[3] [3]

Abstract concept & emotion detection in tagged images with cnns

Youssef Ahres and Nikolaus Volk. Abstract concept & emotion detection in tagged images with cnns. 2016. URL: http://cs231n.stanford.edu/reports/2016/pdfs/008_Report.pdf

work page 2016

[4] [4]

Distant viewing: analyzing large visual corpora

Taylor Arnold and Lauren Tilton. Distant viewing: analyzing large visual corpora. Digital Scholarship in the Humanities , 34(Supplement 1):i3--i16, 2019

work page 2019

[5] [5]

Deep learning architectures for computer vision applications: a study

Randheer Bagi, Tanima Dutta, and Hari Prabhat Gupta. Deep learning architectures for computer vision applications: a study. In Advances in data and information sciences , pages 601--612. Springer, 2020

work page 2020

[6] [6]

Camera Lucida: Reflections on Photography

Roland Barthes. Camera Lucida: Reflections on Photography . New York: Hill and Wang, 1981

work page 1981

[7] [7]

Varieties of abstract concepts: development, use and representation in the brain

Anna M Borghi, Laura Barca, Ferdinand Binkofski, and Luca Tummolini. Varieties of abstract concepts: development, use and representation in the brain. Philosophical Transactions of the Royal Society , 2018

work page 2018

[8] [8]

DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks

Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. ArXiv preprint :1410.8586 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[9] [9]

The art of detection

Elliot J Crowley and Andrew Zisserman. The art of detection. In European conference on computer vision , pages 721--737. Springer, 2016

work page 2016

[10] [10]

Davis and Eiling Yee

Charles P. Davis and Eiling Yee. Building semantic memory from embodied and distributional language experience. WIREs Cognitive Science , e1555, 2021

work page 2021

[11] [11]

A comparative approach between different computer vision tools, including commercial and open-source, for improving cultural image access and analysis

Jos \'e Luis Preza D \' az, Amelie Dorn, Gerda Koch, and Yalemisew Abgaz. A comparative approach between different computer vision tools, including commercial and open-source, for improving cultural image access and analysis. In 2020 10th International Conference on Advanced Computer Information Technologies (ACIT) , pages 815--819. IEEE, 2020

work page 2020

[12] [12]

Understanding the semantic web through descriptions and situations

Aldo Gangemi and Peter Mika. Understanding the semantic web through descriptions and situations. In OTM Confederated International Conferences ``On the Move to Meaningful Internet Systems'' , pages 689--706. Springer, 2003

work page 2003

[13] [13]

Concepts, control, and context: A connectionist account of normal and disordered semantic cognition

Paul Hoffman, James L McClelland, and Matthew A Lambon Ralph. Concepts, control, and context: A connectionist account of normal and disordered semantic cognition. Psychological review , 125(3):293, 2018

work page 2018

[14] [14]

Improving object detection in art images using only style transfer

David Kadish, Sebastian Risi, and Anders Sundnes L vlie. Improving object detection in art images using only style transfer. ArXiv preprint :2102.06529 , 2021

work page arXiv 2021

[15] [15]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123(1):32--73, 2017

work page 2017

[16] [16]

Social roles and their descriptions

Claudio Masolo, Laure Vieu, Emanuele Bottazzi, Carola Catenacci, Roberta Ferrario, Aldo Gangemi, and Nicola Guarino. Social roles and their descriptions. In KR , pages 267--277, 2004

work page 2004

[17] [17]

Representation of concepts as frames

Wiebke Petersen. Representation of concepts as frames . D \"u sseldorf University, 2015. Reprint

work page 2015

[18] [18]

Indexing multimedia and creative works: the problems of meaning and interpretation

Pauline Rafferty and Rob Hidderley. Indexing multimedia and creative works: the problems of meaning and interpretation . Routledge, 2017

work page 2017

[19] [19]

Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals

Nusrat J Shoumy, Li-Minn Ang, Kah Phooi Seng, DM Motiur Rahaman, and Tanveer Zia. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications , 149:102447, 2020

work page 2020

[20] [20]

Evaluation of deep learning on an abstract image classification dataset

Sebastian Stabinger and Antonio Rodriguez-Sanchez. Evaluation of deep learning on an abstract image classification dataset. In Proceedings of the IEEE International Conference on Computer Vision Workshops , pages 2767--2772, 2017

work page 2017

[21] [21]

Describing low-level image features using the comm ontology

Miroslav Vacura, Vojtech Sv \'a tek, Carsten Saathoff, Thomas Franz, and Rapha \"e l Troncy. Describing low-level image features using the comm ontology. In 2008 15th IEEE International Conference on Image Processing , pages 49--52. IEEE, 2008

work page 2008