Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

Delfina Sol Martinez Pandiani; Nicolas Lazzari; Valentina Presutti

arxiv: 2402.19339 · v1 · submitted 2024-02-29 · 💻 cs.CV · cs.AI

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

Delfina Sol Martinez Pandiani , Nicolas Lazzari , Valentina Presutti This is my paper

Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords abstract concept classificationknowledge graph embeddingsvision transformersneuro-symbolic methodssituated perceptual knowledgeimage understandinghybrid modelsinterpretability

0 comments

The pith

Fusing knowledge graph embeddings of situated perceptual knowledge with Vision Transformer features improves abstract concept image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that automatically extracting perceptual semantic units from cultural images, modeling them in the ARTstract Knowledge Graph, and fusing the resulting embeddings with visual transformer features produces better performance than existing methods on abstract concept classification. A sympathetic reader would care because standard deep vision models excel at low-level pixel patterns but often miss the context-dependent, semantic understanding humans bring to high-level image interpretation. The work shows complementarity: vision transformers handle sensory attributes while the knowledge graph component represents more abstract scene elements. This hybrid approach is presented as evidence that neuro-symbolic integration can address gaps in current visual comprehension systems for downstream tasks.

Core claim

Hybrid KGE-ViT methods that combine embeddings from the ARTstract Knowledge Graph (built from over 14,000 labeled cultural images and enriched with linguistic frames) with Vision Transformer embeddings outperform existing techniques on abstract concept image classification; posthoc analyses indicate that the visual transformer captures pixel-level attributes while the fused method better represents abstract and semantic scene elements, revealing synergy between situated perceptual knowledge in the KGE and sensory-perceptual understanding in the deep model.

What carries the argument

The ARTstract Knowledge Graph (AKG) that encodes automatically extracted perceptual semantic units and high-level linguistic frames, whose embeddings are fused with Vision Transformer embeddings via relative representations and hybrid approaches.

If this is right

The hybrid methods achieve higher accuracy than existing techniques specifically on abstract concept image classification tasks.
Posthoc interpretability shows the visual transformer focuses on pixel-level visual attributes while the KGE component handles more abstract and semantic scene elements.
The demonstrated synergy between KGE embeddings and ViT features supports the use of neuro-symbolic methods for knowledge integration in visual representation.
The approach suggests potential for improved performance on downstream intricate visual comprehension tasks that require both sensory and situated knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the AKG extraction process generalizes beyond the cultural image dataset, the same fusion technique could be applied to other high-level vision domains such as scene understanding in video or medical imaging.
The complementarity finding implies that pure scaling of vision transformers may hit limits on tasks requiring explicit semantic context, pointing toward systematic testing of KG fusion on distribution-shift benchmarks.
One testable extension is whether the relative representation method used for fusion reduces the need for large amounts of labeled data compared to end-to-end fine-tuning alone.

Load-bearing premise

The automatically extracted perceptual semantic units and resulting ARTstract Knowledge Graph accurately encode situated, context-dependent human knowledge of abstract concepts such that embedding and fusing them with ViT features yields genuine generalization rather than dataset-specific fitting.

What would settle it

A test in which the knowledge graph embeddings are replaced by embeddings from a randomly constructed graph with the same structure; if the hybrid method no longer outperforms the pure ViT baseline on the same test set, the contribution of the situated perceptual knowledge would be falsified.

Figures

Figures reproduced from arXiv: 2402.19339 by Delfina Sol Martinez Pandiani, Nicolas Lazzari, Valentina Presutti.

**Figure 1.** Figure 1: Subset of the A-Box of ARTstract-KG, showing the types of commonsense linguistic knowledge connected to a single image instance. Most annotations are typed by ConceptNet concepts, while the image captions are typed by WordNet concepts as well as by linguistic frames. 3.2. ARTstract Knowledge Graph Creation We use the SituAnnotate ontology [53], which models the situated assignment of annotation labels to … view at source ↗

**Figure 2.** Figure 2: Macro F1 scores on the AC image classification tasks for different input embeddings. Absolute versus Relative Embeddings RelKGE outperformed absKGE, achieving a higher Macro F1 score of 0.27 compared to 0.22 (see [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Absolute ViT vs. Absolute KGE embeddings capture different aspects of ARTstract images. Top: Absolute ViT captures aspects that resemble the United States flag while KGE captures more landscape-related features, Bottom: Absolute KGE demonstrates superior semantic performance than ViT by encoding similarities with perceptually diverse visions of the Statue of Liberty [PITH_FULL_IMAGE:figures/full_fig_p010… view at source ↗

**Figure 4.** Figure 4: Contrasting semantic proficiency of Absolute KGE vs. Absolute ViT. The top image illustrates ViT’s focus on colors and textures (aesthetics), whereas KGE excels in recognizing explicit semantics, particularly women sitting on couches. In the bottom image, KGE effectively encodes the semantics of reading a book in the test artwork. Multiple test instances suggest that the KGE method exhibits superior perfor… view at source ↗

**Figure 5.** Figure 5: ViT misclassifies as death, but KGE successfully associates images with crosses to the concept of comfort, indicating ViT’s focus on colors and textures [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: ViT misclassifies as comfort, but KGE successfully associates images with crosses to the concept of death. comfort, likely due to the original image’s warm colors, landscape composition and drawing/cartoon-like drawing features, the top similar images as based on vit feature outdoor scenes irrelevant to the ground truth of death. In contrast, the top three similar images based on KGE embeddings share the c… view at source ↗

**Figure 7.** Figure 7: Interpretability results for a test image labeled as fitness. Top similar anchors are shown for the test instance using relative ViT embeddings (top row), relative KGE embeddings (middle rows), and hybrid embeddings. Shared ARTstract-KG nodes accompany each row. The hybrid embedding integrates complementary information from both relative embeddings to prioritize anchors tagged as fitness. These findings h… view at source ↗

read the original abstract

The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hybrid KGE-ViT fusion on a new cultural AKG is a standard neuro-symbolic move but the abstract gives zero metrics or extraction validation, so the synergy claim rests on untested assumptions.

read the letter

The paper's core move is to build the ARTstract Knowledge Graph from perceptual semantic units automatically pulled out of 14k cultural images, enrich it with linguistic frames, embed it, and fuse those embeddings with ViT features for abstract-concept classification. They also run post-hoc similarity checks for interpretability. That is the actual new piece: the specific resource and the hybrid setup applied to this domain. The complementarity angle (ViT on pixels, KG on semantics) is a familiar neuro-symbolic pattern, but executing it on situated cultural images is a legitimate next step and the open code is useful. The motivation for better high-level semantic handling is clear and the qualitative analysis direction makes sense if the numbers back it up. The main weakness is exactly what the stress-test flags. The abstract states that the hybrid models outperform baselines and that the KG supplies genuine situated knowledge, yet it reports no accuracy figures, no baseline tables, no ablations, and no human validation or inter-rater scores on the automatic extraction step. Without those, any lift could be an artifact of how the units were pulled or of simply adding extra embeddings rather than real complementarity. The extraction process itself is described at a high level but never checked against an independent gold standard or tested for sensitivity to extraction choices. This is not a minor omission; it sits at the center of the argument. The work is aimed at researchers already working on neuro-symbolic vision or semantic image tasks in cultural domains. A reader in that niche could extract the fusion recipe and the resource construction details even if the performance claims need heavy qualification. It is coherent enough on its own terms to warrant a serious referee rather than an immediate desk reject, provided the full manuscript supplies the missing quantitative controls and validation steps. I would send it out for review with explicit instructions to the referees to check the extraction quality and the statistical support for the claimed gains.

Referee Report

3 major / 1 minor

Summary. The paper introduces a neuro-symbolic approach for classifying abstract concepts (ACs) in cultural images. It automatically extracts perceptual semantic units from over 14,000 labeled images to construct the ARTstract Knowledge Graph (AKG), augments it with linguistic frames, computes KG embeddings, and fuses them with Vision Transformer (ViT) features via relative representations and hybrid models. The central claims are that the hybrid KGE-ViT methods outperform prior techniques on AC classification and that post-hoc analyses demonstrate complementarity, with KGE capturing abstract semantic elements and ViT handling pixel-level attributes.

Significance. If the empirical claims hold after addressing validation gaps, the work would contribute to neuro-symbolic computer vision by showing how situated perceptual knowledge from a domain-specific KG can complement sensory features from transformers for high-level semantic tasks. The public release of materials and code is a positive factor that supports reproducibility.

major comments (3)

[Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.
[Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.
[Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.

minor comments (1)

[Abstract] The abstract and introduction use 'ACs' without an initial expansion on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity, rigor, and completeness.

read point-by-point responses

Referee: [Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.

Authors: We agree that additional validation details would strengthen the presentation. In the revision we will expand the methods section with an explicit description of extraction parameters, an ablation study varying those parameters, and a discussion of potential dataset biases. We will also include a small-scale comparison against a manually reviewed subset of extracted units to provide an independent check, while noting that a full inter-annotator agreement study was outside the original scope. revision: yes
Referee: [Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.

Authors: The results section reports performance numbers and baseline comparisons, yet we acknowledge that statistical tests and additional ablations would make the evidence more robust. We will revise the section to tabulate all quantitative metrics explicitly, add paired statistical significance tests, and include further ablation experiments isolating the contribution of the KGE component versus embedding dimensionality effects. revision: yes
Referee: [Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.

Authors: We will augment the interpretability analyses with quantitative support, specifically by reporting distributions of similarity scores for each model type and by adding controlled example pairs with numerical similarity values to demonstrate the differing focus of KGE versus ViT representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fusion evaluated on held-out data

full rationale

The paper describes an empirical pipeline: automatic extraction of perceptual units from labeled images to construct the ARTstract KG, computation of KG embeddings, fusion with ViT features, and accuracy comparison against baselines. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the hybrid performance result equivalent to its inputs by construction. The central claim rests on standard train/test splits and external resource construction rather than internal re-derivation, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view supplies minimal technical detail; the central claim depends on the unverified quality of the perceptual unit extraction process and the assumption that KG embeddings meaningfully encode situated knowledge.

axioms (1)

domain assumption Automatically extracted perceptual semantic units from images can be modeled to capture situated perceptual semantics of abstract concepts.
Invoked when constructing the AKG and using its embeddings for classification.

invented entities (1)

ARTstract Knowledge Graph (AKG) no independent evidence
purpose: Captures situated perceptual semantics from over 14,000 cultural images labeled with abstract concepts, augmented with linguistic frames.
Newly constructed resource described in the abstract.

pith-pipeline@v0.9.0 · 5813 in / 1472 out tokens · 44446 ms · 2026-05-24T03:29:20.736367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

[1]

Derf: Decomposed radiance fields,

Panos Achlioptas et al. “ArtEmis: Affective Language for Visual Art”. en. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, vir- tual, June 19-25, 2021. Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579. DOI: 10.1109/CVPR46437.2021.01140 . (Visited on 02/09/2022)

work page doi:10.1109/cvpr46437.2021.01140 2021
[2]

Explicit reasoning over end-to- end neural architectures for visual question answering

Somak Aditya, Yezhou Yang, and Chitta Baral. “Explicit reasoning over end-to- end neural architectures for visual question answering”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 32. 2018

work page 2018
[3]

Integrating knowledge and rea- soning in image understanding

Somak Aditya, Yezhou Yang, and Chitta Baral. “Integrating knowledge and rea- soning in image understanding”. In: 28th International Joint Conference on Artifi- cial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelli- gence. 2019, pp. 6252–6259

work page 2019
[4]

A public domain dataset for human activity recognition using smartphones

Davide Anguita et al. “A public domain dataset for human activity recognition using smartphones.” In: Esann. V ol. 3. 2013, p. 3

work page 2013
[5]

Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture

Taylor Arnold and Lauren Tilton. “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture”. In: Journal of Open Source Software 5.45 (Jan. 2020), p. 1800. ISSN : 2475-9066. DOI: 10.21105/joss.01800. (Visited on 12/13/2021)

work page doi:10.21105/joss.01800 2020
[6]

Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases

Michael van Bekkum et al. “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases”. In: arXiv:2102.11965 [cs] 51.9 (Mar. 2021), pp. 6528–6546. (Visited on 01/20/2022)

work page arXiv 2021
[7]

A Survey on Word Meta-Embedding Learning

Danushka Bollegala and James O’Neill. “A Survey on Word Meta-Embedding Learning”. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 . Ed. by Luc De Raedt. ijcai.org, 2022, pp. 5402–5409. DOI: 10.24963/IJCAI.2022/758

work page doi:10.24963/ijcai.2022/758 2022
[8]

Translating embeddings for modeling multi-relational data

Antoine Bordes et al. “Translating embeddings for modeling multi-relational data”. In: Advances in neural information processing systems 26 (2013). 16 Martinez Pandiani et al. 2024 (Preprint) /

work page 2013
[9]

Negative results in computer vision: A perspective

Ali Borji. “Negative results in computer vision: A perspective”. In: Image and Vision Computing 69 (2018), pp. 1–8

work page 2018
[10]

Culture and human development: A new look

Jerome Bruner. “Culture and human development: A new look”. In: Human devel- opment 33.6 (1990), pp. 344–355

work page 1990
[11]

Scalable Theory-Driven Regularization of Scene Graph Generation Models

Davide Buffelli and Efthymia Tsamoura. “Scalable Theory-Driven Regularization of Scene Graph Generation Models”. In:Thirty-Seventh AAAI Conference on Arti- ficial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Ad- vances in Artificial Intelligence, EAAI...

work page doi:10.1609/aaai.v37i6.25839 2023
[12]

End-to-end object detection with transformers

Nicolas Carion et al. “End-to-end object detection with transformers”. In: Euro- pean conference on computer vision. Springer. 2020, pp. 213–229

work page 2020
[13]

Iterative visual reasoning beyond convolutions

Xinlei Chen et al. “Iterative visual reasoning beyond convolutions”. In: Proc. of CVPR 2018. IEEE. 2018, pp. 7239–7248

work page 2018
[14]

Computers in Human Behavior68, 83–95 (2017) https://doi.org/10.1016/j.chb

F. Ciroku et al. “Automated multimodal sensemaking: Ontology-based integra- tion of linguistic frames and visual data”. In: Computers in Human Behavior 150 (2024), p. 107997. ISSN : 0747-5632. DOI: https://doi.org/10.1016/j.chb. 2023.107997

work page doi:10.1016/j.chb 2024
[15]

The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia

Sebastian J Crutch, Basil H Ridha, and Elizabeth K Warrington. “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia”. In: Neurocase 12.3 (2006), pp. 151–163

work page 2006
[16]

Applying fuzzy DLs in the extraction of image semantics

Stamatia Dasiopoulou, Ioannis Kompatsiaris, and Michael G Strintzis. “Applying fuzzy DLs in the extraction of image semantics”. In: Journal on data semantics XIV. Springer, 2009, pp. 105–132

work page 2009
[17]

Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm

Jon Andoni Du ˜nabeitia et al. “Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm”. In:Cog- nition 110.2 (2009), pp. 284–292

work page 2009
[18]

Multimodal learning with graphs

Yasha Ektefaie et al. “Multimodal learning with graphs”. In: Nat. Mac. Intell. 5.4 (2023), pp. 340–350. DOI: 10.1038/S42256-023-00624-6

work page doi:10.1038/s42256-023-00624-6 2023
[19]

Cognition does not affect perception: Evalu- ating the evidence for “top-down

Chaz Firestone and Brian J Scholl. “Cognition does not affect perception: Evalu- ating the evidence for “top-down” effects”. In: Behavioral and brain sciences 39 (2016)

work page 2016
[20]

N., Abdrasheva, G

Aldo Gangemi et al. “Framester: A wide coverage linguistic linked data hub”. en. In: European Knowledge Acquisition Workshop. Ed. by Eva Blomqvist et al. Lec- ture Notes in Computer Science. Springer. Cham: Springer International Publish- ing, 2016, pp. 239–254. ISBN : 978-3-319-49004-5. DOI: 10.1007/978-3-319- 49004-5\_16

work page doi:10.1007/978-3-319- 2016
[21]

An End-To-End Network for Gen- erating Social Relationship Graphs

Arushi Goel, Keng Teck Ma, and Cheston Tan. “An End-To-End Network for Gen- erating Social Relationship Graphs”. In:2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 11178–11187. ISBN : 978-1-72813-293-8. DOI: 10.1109/CVPR.2019.01144. Martinez Pandiani et al. 2024 (Preprint) / 17

work page doi:10.1109/cvpr.2019.01144 2019
[22]

In: Proc

Douglas Gray et al. “Predicting Facial Beauty without Landmarks”. In: Com- puter Vision – ECCV 2010. Ed. by Kostas Daniilidis, Petros Maragos, and Nikos Paragios. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 434–447. ISBN : 978-3-642-15567-3. DOI: 10.1007/978- 3- 642- 15567- 3\_32

work page doi:10.1007/978- 2010
[23]

Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility

Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. “Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility”. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 952–962. ISBN : 978-1-66542-812-5. DOI: 10 . 1109 / ICCV48922 . 2021 . 00101. (Visited on 03/03/2022)

work page 2021
[24]

Deep multimodal represen- tation learning: A survey

Wenzhong Guo, Jianwen Wang, and Shiping Wang. “Deep multimodal represen- tation learning: A survey”. In: IEEE Access 7 (2019), pp. 63373–63394

work page 2019
[25]

ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge

Catherine Havasi, Robert Speer, and Jason Alonso. “ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge”. In:Recent advances in natural language processing. John Benjamins Philadelphia, PA. 2007, pp. 27–29

work page 2007
[26]

Deep residual learning for image recognition,

Kaiming He et al. “Deep Residual Learning for Image Recognition”. en. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Ve- gas, NV , USA: IEEE, June 2016, pp. 770–778. ISBN : 978-1-4673-8851-1. DOI: 10.1109/CVPR.2016.90. (Visited on 02/15/2022)

work page doi:10.1109/cvpr.2016.90 2016
[27]

Concepts, control, and context: A connectionist account of normal and disordered semantic cognition

Paul Hoffman. “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” en. In: Psychological Review 125.3 (2018), p. 293. ISSN : 1939-1471. DOI: 10.1037/rev0000094. (Visited on 12/13/2021)

work page doi:10.1037/rev0000094 2018
[28]

Putting objects in perspec- tive

Derek Hoiem, Alexei A Efros, and Martial Hebert. “Putting objects in perspec- tive”. In: International Journal of Computer Vision 80 (2008), pp. 3–15

work page 2008
[29]

Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features

X. Huang and A. Kovashka. “Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features”. In:IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition Workshops. 2016, pp. 778–784. ISBN : 978-1-4673- 8850-4. DOI: 10.1109/CVPRW.2016.102

work page doi:10.1109/cvprw.2016.102 2016
[30]

Automatic Understanding of Image and Video Advertise- ments

Zaeem Hussain et al. “Automatic Understanding of Image and Video Advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715

work page 2017
[31]

Automatic understanding of image and video advertise- ments

Zaeem Hussain et al. “Automatic understanding of image and video advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715. (Visited on 01/18/2022)

work page 2017
[32]

Discovering states and trans- formations in image collections

Phillip Isola, Joseph J Lim, and Edward H Adelson. “Discovering states and trans- formations in image collections”. In:Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 1383–1391

work page 2015
[33]

A Review on Methods and Applications in Multimodal Deep Learning

Summaira Jabeen et al. “A Review on Methods and Applications in Multimodal Deep Learning”. In: ACM Trans. Multim. Comput. Commun. Appl. 19.2s (2023), 76:1–76:41. DOI: 10.1145/3545572

work page doi:10.1145/3545572 2023
[34]

Derf: Decomposed radiance fields,

Menglin Jia et al. “Intentonomy: a Dataset and Study towards Human Intent Un- derstanding”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 12981–12991. ISBN : 978-1-66544-509-2. DOI: 10.1109/CVPR46437.2021.01279. (Visited on 02/28/2022). 18 Martinez Pandiani et al. 2024 (Preprint) /

work page doi:10.1109/cvpr46437.2021.01279 2021
[35]

Visual Persuasion: Inferring Communicative Intents of Im- ages

Jungseock Joo et al. “Visual Persuasion: Inferring Communicative Intents of Im- ages”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 216–223. (Visited on 01/18/2022)

work page 2014
[36]

Symbolic image detection using scene and knowledge graphs

Nasrin Kalanat and Adriana Kovashka. “Symbolic image detection using scene and knowledge graphs”. In: arXiv preprint arXiv:2206.04863 (2022)

work page arXiv 2022
[37]

Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation

Kimmo Karkkainen and Jungseock Joo. “Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation”. In: Proceed- ings of the IEEE/CVF winter conference on applications of computer vision. 2021, pp. 1548–1558

work page 2021
[38]

The representation of abstract words: Why emo- tion matters

Stavroula-Thaleia Kousta et al. “The representation of abstract words: Why emo- tion matters”. In: Journal of Experimental Psychology: General 140.1 (2011), pp. 14–34. ISSN : 1939-2222. DOI: 10.1037/a0021446

work page doi:10.1037/a0021446 2011
[39]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: arXiv:1602.07332 [cs] 123.1 (Feb. 2016), pp. 32–73. (Visited on 12/14/2021)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li et al. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”. In: International Conference on Machine Learning. PMLR. 2022, pp. 12888–12900

work page 2022
[41]

Dual-Glance Model for Deciphering Social Relationships

Junnan Li et al. “Dual-Glance Model for Deciphering Social Relationships”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 2669–2678. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017.289

work page 2017
[42]

Situation Recognition with Graph Neural Networks

Ruiyu Li et al. “Situation Recognition with Graph Neural Networks”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 4183–4192. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017 . 448

work page 2017
[43]

Graph-Based Social Relation Reasoning

Wanhua Li et al. “Graph-Based Social Relation Reasoning”. In: Computer Vi- sion – ECCV 2020. Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Sci- ence. Cham: Springer International Publishing, 2020, pp. 18–34.ISBN : 978-3-030- 58555-6. DOI: 10.1007/978-3-030-58555-6\_2

work page doi:10.1007/978-3-030-58555-6 2020
[44]

GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph

Xin Li et al. “GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph”. In: CoRR abs/2309.13625 (2023). DOI: 10.48550/ARXIV.2309. 13625. arXiv: 2309.13625

work page doi:10.48550/arxiv.2309 2023
[45]

The artbench dataset: Benchmarking generative models with artworks

Peiyuan Liao et al. “The artbench dataset: Benchmarking generative models with artworks”. In: arXiv preprint arXiv:2206.11404 (2022)

work page arXiv 2022
[46]

Microsoft coco: Common objects in context

Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755

work page 2014
[47]

ConceptNet–a practical commonsense reasoning tool- kit

Hugo Liu and Push Singh. “ConceptNet–a practical commonsense reasoning tool- kit”. In: BT technology journal 22.4 (2004), pp. 211–226

work page 2004
[48]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104

work page 2016
[49]

Collective activity detection using hinge-loss Markov random fields

Ben London et al. “Collective activity detection using hinge-loss Markov random fields”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013, pp. 566–571. Martinez Pandiani et al. 2024 (Preprint) / 19

work page 2013
[50]

The More You Know: Using Knowledge Graphs for Image Classification

Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. “The More You Know: Using Knowledge Graphs for Image Classification”. In: 2017 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 20–28.DOI: 10.1109/ CVPR.2017.10

work page 2017
[51]

Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames

D. S. Martinez Pandiani and V . Presutti. “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames”. In: Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021). Zaragoza, Spain, 2021, arXiv–2110

work page 2021
[52]

Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

D. S. Martinez Pandiani and V . Presutti. “Seeing the Intangible: Survey of Im- age Classification into High-Level and Abstract Categories”. In: arXiv preprint arXiv:2308.10562 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate

D.S. Martinez Pandiani and V . Presutti. “Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate”. In: [Under Review] Spe- cial Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowl- edge Discovery from Data (TKDD) (2024)

work page 2024
[54]

Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data

D.S. Martinez Pandiani et al. “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data”. In: International Journal of Digital Humanities (IJDH) (2023)

work page 2023
[55]

Relative representations enable zero-shot latent space com- munication

Luca Moschella et al. “Relative representations enable zero-shot latent space com- munication”. In: The Eleventh International Conference on Learning Representa- tions. 2022

work page 2022
[56]

ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training

Antonio Norelli et al. “ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training”. In: CoRR abs/2210.01738 (2022). DOI: 10 . 48550 / ARXIV.2210.01738. arXiv: 2210.01738

work page arXiv 2022
[57]

CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

Zachary Novack et al. “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets”. In: International Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA . Ed. by Andreas Krause et al. V ol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 26342–26362

work page 2023
[58]

Grounded Situation Recognition

Sarah Pratt et al. “Grounded Situation Recognition”. In: Computer Vision – ECCV

work page
[59]

by Andrea Vedaldi et al

Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Science. Springer. Cham: Springer International Publishing, 2020, pp. 314–332. ISBN : 978-3-030- 58548-8. DOI: 10.1007/978-3-030-58548-8\_19

work page doi:10.1007/978-3-030-58548-8 2020
[60]

Recognition using visual phrases

Mohammad Amin Sadeghi and Ali Farhadi. “Recognition using visual phrases”. In: Cvpr 2011. Ieee. 2011, pp. 1745–1752

work page 2011
[61]

Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works

Cristina Segalin, Dong Seon Cheng, and Marco Cristani. “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works”. In: Computer Vision and Image Understanding. Image and Video Under- standing in Big Data 156 (Mar. 2017), pp. 34–50. ISSN : 1077-3142. DOI: 10 . 1016/j.cviu.2016.10.013

work page 2017
[62]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015

work page 2015
[63]

Conceptnet 5.5: An open multi- lingual graph of general knowledge

Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multi- lingual graph of general knowledge”. In:Thirty-first AAAI Conference on Artificial Intelligence. 2017. 20 Martinez Pandiani et al. 2024 (Preprint) /

work page 2017
[64]

Mixture-Kernel Graph Attention Network for Situation Recognition

Mohammed Suhail and Leonid Sigal. “Mixture-Kernel Graph Attention Network for Situation Recognition”. In:2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 10362–10371. ISBN : 978-1-72814-803-8. DOI: 10.1109/ICCV.2019.01046

work page doi:10.1109/iccv.2019.01046 2019
[65]

A Domain Based Approach to Social Relation Recognition

Qianru Sun, Bernt Schiele, and Mario Fritz. “A Domain Based Approach to Social Relation Recognition”. In:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 435–444. ISBN : 978-1- 5386-0457-1. DOI: 10.1109/CVPR.2017.54. (Visited on 01/19/2022)

work page doi:10.1109/cvpr.2017.54 2017
[66]

Computer vision: algorithms and applications

Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022

work page 2022
[67]

Knowledge graphs as tools for explainable machine learning: A survey

Ilaria Tiddi and Stefan Schlobach. “Knowledge graphs as tools for explainable machine learning: A survey”. In: Artificial Intelligence 302 (2022), p. 103627

work page 2022
[68]

Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions

Antoine Toisoul et al. “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions”. In:Nature Machine Intelligence3.1 (Jan. 2021), pp. 42–50. ISSN : 2522-5839. DOI: 10.1038/s42256-020-00280-0

work page doi:10.1038/s42256-020-00280-0 2021
[69]

The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011)

Gabriella Vigliocco et al. “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” In: (2013)

work page 2013
[70]

Knowledge graph embedding: A survey of approaches and applications

Quan Wang et al. “Knowledge graph embedding: A survey of approaches and applications”. In: IEEE Transactions on Knowledge and Data Engineering 29.12 (2017), pp. 2724–2743

work page 2017
[71]

Understanding and Map- ping Natural Beauty

Scott Workman, Richard Souvenir, and Nathan Jacobs. “Understanding and Map- ping Natural Beauty”. In: 2017 IEEE International Conference on Computer Vi- sion (ICCV). Venice: IEEE, Oct. 2017, pp. 5590–5599.ISBN : 978-1-5386-1032-9. DOI: 10.1109/ICCV.2017.596

work page doi:10.1109/iccv.2017.596 2017
[72]

Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval

Xingxu Yao et al. “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval”. In: 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1140–1150. ISBN : 978- 1-72814-803-8. DOI: 10.1109/ICCV.2019.00123

work page doi:10.1109/iccv.2019.00123 2019
[73]

Situation Recognition: Visual Semantic Role Labeling for Image Understanding

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation Recognition: Visual Semantic Role Labeling for Image Understanding”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, NV , USA: IEEE, June 2016, pp. 5534–5542. ISBN : 978-1-4673-8851-1. DOI: 10 . 1109 / CVPR . 2016.597

work page 2016
[74]

In: Conference on Robot Learning, pp

K. Ye and A. Kovashka. “ADVISE: Symbolism and External Knowledge for De- coding Advertisements”. In: Computer Vision – ECCV 2018 . Ed. by Vittorio Ferrari et al. V ol. 11219 LNCS. Cham: Springer International Publishing, 2018, pp. 868–886. ISBN : 9783030012663. DOI: 10.1007/978-3-030-01267-0\_51

work page doi:10.1007/978-3-030-01267-0 2018
[75]

Interpreting the Rhetoric of Visual Advertisements

Keren Ye et al. “Interpreting the Rhetoric of Visual Advertisements”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4 (Apr. 2019), pp. 1308–1323. ISSN : 1939-3539. DOI: 10.1109/TPAMI.2019.2947440

work page doi:10.1109/tpami.2019.2947440 2019
[76]

Scaling Vision Transformers

Xiaohua Zhai et al. “Scaling Vision Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 . IEEE, 2022, pp. 1204–1213. DOI: 10 . 1109 / CVPR52688 . 2022.01179

work page arXiv 2022
[77]

Reasoning about object affordances in a knowledge base representation

Yuke Zhu, Alireza Fathi, and Li Fei-Fei. “Reasoning about object affordances in a knowledge base representation”. In: European conference on computer vision . Springer. 2014, pp. 408–424

work page 2014

[1] [1]

Derf: Decomposed radiance fields,

Panos Achlioptas et al. “ArtEmis: Affective Language for Visual Art”. en. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, vir- tual, June 19-25, 2021. Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579. DOI: 10.1109/CVPR46437.2021.01140 . (Visited on 02/09/2022)

work page doi:10.1109/cvpr46437.2021.01140 2021

[2] [2]

Explicit reasoning over end-to- end neural architectures for visual question answering

Somak Aditya, Yezhou Yang, and Chitta Baral. “Explicit reasoning over end-to- end neural architectures for visual question answering”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 32. 2018

work page 2018

[3] [3]

Integrating knowledge and rea- soning in image understanding

Somak Aditya, Yezhou Yang, and Chitta Baral. “Integrating knowledge and rea- soning in image understanding”. In: 28th International Joint Conference on Artifi- cial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelli- gence. 2019, pp. 6252–6259

work page 2019

[4] [4]

A public domain dataset for human activity recognition using smartphones

Davide Anguita et al. “A public domain dataset for human activity recognition using smartphones.” In: Esann. V ol. 3. 2013, p. 3

work page 2013

[5] [5]

Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture

Taylor Arnold and Lauren Tilton. “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture”. In: Journal of Open Source Software 5.45 (Jan. 2020), p. 1800. ISSN : 2475-9066. DOI: 10.21105/joss.01800. (Visited on 12/13/2021)

work page doi:10.21105/joss.01800 2020

[6] [6]

Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases

Michael van Bekkum et al. “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases”. In: arXiv:2102.11965 [cs] 51.9 (Mar. 2021), pp. 6528–6546. (Visited on 01/20/2022)

work page arXiv 2021

[7] [7]

A Survey on Word Meta-Embedding Learning

Danushka Bollegala and James O’Neill. “A Survey on Word Meta-Embedding Learning”. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 . Ed. by Luc De Raedt. ijcai.org, 2022, pp. 5402–5409. DOI: 10.24963/IJCAI.2022/758

work page doi:10.24963/ijcai.2022/758 2022

[8] [8]

Translating embeddings for modeling multi-relational data

Antoine Bordes et al. “Translating embeddings for modeling multi-relational data”. In: Advances in neural information processing systems 26 (2013). 16 Martinez Pandiani et al. 2024 (Preprint) /

work page 2013

[9] [9]

Negative results in computer vision: A perspective

Ali Borji. “Negative results in computer vision: A perspective”. In: Image and Vision Computing 69 (2018), pp. 1–8

work page 2018

[10] [10]

Culture and human development: A new look

Jerome Bruner. “Culture and human development: A new look”. In: Human devel- opment 33.6 (1990), pp. 344–355

work page 1990

[11] [11]

Scalable Theory-Driven Regularization of Scene Graph Generation Models

Davide Buffelli and Efthymia Tsamoura. “Scalable Theory-Driven Regularization of Scene Graph Generation Models”. In:Thirty-Seventh AAAI Conference on Arti- ficial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Ad- vances in Artificial Intelligence, EAAI...

work page doi:10.1609/aaai.v37i6.25839 2023

[12] [12]

End-to-end object detection with transformers

Nicolas Carion et al. “End-to-end object detection with transformers”. In: Euro- pean conference on computer vision. Springer. 2020, pp. 213–229

work page 2020

[13] [13]

Iterative visual reasoning beyond convolutions

Xinlei Chen et al. “Iterative visual reasoning beyond convolutions”. In: Proc. of CVPR 2018. IEEE. 2018, pp. 7239–7248

work page 2018

[14] [14]

Computers in Human Behavior68, 83–95 (2017) https://doi.org/10.1016/j.chb

F. Ciroku et al. “Automated multimodal sensemaking: Ontology-based integra- tion of linguistic frames and visual data”. In: Computers in Human Behavior 150 (2024), p. 107997. ISSN : 0747-5632. DOI: https://doi.org/10.1016/j.chb. 2023.107997

work page doi:10.1016/j.chb 2024

[15] [15]

The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia

Sebastian J Crutch, Basil H Ridha, and Elizabeth K Warrington. “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia”. In: Neurocase 12.3 (2006), pp. 151–163

work page 2006

[16] [16]

Applying fuzzy DLs in the extraction of image semantics

Stamatia Dasiopoulou, Ioannis Kompatsiaris, and Michael G Strintzis. “Applying fuzzy DLs in the extraction of image semantics”. In: Journal on data semantics XIV. Springer, 2009, pp. 105–132

work page 2009

[17] [17]

Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm

Jon Andoni Du ˜nabeitia et al. “Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm”. In:Cog- nition 110.2 (2009), pp. 284–292

work page 2009

[18] [18]

Multimodal learning with graphs

Yasha Ektefaie et al. “Multimodal learning with graphs”. In: Nat. Mac. Intell. 5.4 (2023), pp. 340–350. DOI: 10.1038/S42256-023-00624-6

work page doi:10.1038/s42256-023-00624-6 2023

[19] [19]

Cognition does not affect perception: Evalu- ating the evidence for “top-down

Chaz Firestone and Brian J Scholl. “Cognition does not affect perception: Evalu- ating the evidence for “top-down” effects”. In: Behavioral and brain sciences 39 (2016)

work page 2016

[20] [20]

N., Abdrasheva, G

Aldo Gangemi et al. “Framester: A wide coverage linguistic linked data hub”. en. In: European Knowledge Acquisition Workshop. Ed. by Eva Blomqvist et al. Lec- ture Notes in Computer Science. Springer. Cham: Springer International Publish- ing, 2016, pp. 239–254. ISBN : 978-3-319-49004-5. DOI: 10.1007/978-3-319- 49004-5\_16

work page doi:10.1007/978-3-319- 2016

[21] [21]

An End-To-End Network for Gen- erating Social Relationship Graphs

Arushi Goel, Keng Teck Ma, and Cheston Tan. “An End-To-End Network for Gen- erating Social Relationship Graphs”. In:2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 11178–11187. ISBN : 978-1-72813-293-8. DOI: 10.1109/CVPR.2019.01144. Martinez Pandiani et al. 2024 (Preprint) / 17

work page doi:10.1109/cvpr.2019.01144 2019

[22] [22]

In: Proc

Douglas Gray et al. “Predicting Facial Beauty without Landmarks”. In: Com- puter Vision – ECCV 2010. Ed. by Kostas Daniilidis, Petros Maragos, and Nikos Paragios. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 434–447. ISBN : 978-3-642-15567-3. DOI: 10.1007/978- 3- 642- 15567- 3\_32

work page doi:10.1007/978- 2010

[23] [23]

Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility

Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. “Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility”. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 952–962. ISBN : 978-1-66542-812-5. DOI: 10 . 1109 / ICCV48922 . 2021 . 00101. (Visited on 03/03/2022)

work page 2021

[24] [24]

Deep multimodal represen- tation learning: A survey

Wenzhong Guo, Jianwen Wang, and Shiping Wang. “Deep multimodal represen- tation learning: A survey”. In: IEEE Access 7 (2019), pp. 63373–63394

work page 2019

[25] [25]

ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge

Catherine Havasi, Robert Speer, and Jason Alonso. “ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge”. In:Recent advances in natural language processing. John Benjamins Philadelphia, PA. 2007, pp. 27–29

work page 2007

[26] [26]

Deep residual learning for image recognition,

Kaiming He et al. “Deep Residual Learning for Image Recognition”. en. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Ve- gas, NV , USA: IEEE, June 2016, pp. 770–778. ISBN : 978-1-4673-8851-1. DOI: 10.1109/CVPR.2016.90. (Visited on 02/15/2022)

work page doi:10.1109/cvpr.2016.90 2016

[27] [27]

Concepts, control, and context: A connectionist account of normal and disordered semantic cognition

Paul Hoffman. “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” en. In: Psychological Review 125.3 (2018), p. 293. ISSN : 1939-1471. DOI: 10.1037/rev0000094. (Visited on 12/13/2021)

work page doi:10.1037/rev0000094 2018

[28] [28]

Putting objects in perspec- tive

Derek Hoiem, Alexei A Efros, and Martial Hebert. “Putting objects in perspec- tive”. In: International Journal of Computer Vision 80 (2008), pp. 3–15

work page 2008

[29] [29]

Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features

X. Huang and A. Kovashka. “Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features”. In:IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition Workshops. 2016, pp. 778–784. ISBN : 978-1-4673- 8850-4. DOI: 10.1109/CVPRW.2016.102

work page doi:10.1109/cvprw.2016.102 2016

[30] [30]

Automatic Understanding of Image and Video Advertise- ments

Zaeem Hussain et al. “Automatic Understanding of Image and Video Advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715

work page 2017

[31] [31]

Automatic understanding of image and video advertise- ments

Zaeem Hussain et al. “Automatic understanding of image and video advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715. (Visited on 01/18/2022)

work page 2017

[32] [32]

Discovering states and trans- formations in image collections

Phillip Isola, Joseph J Lim, and Edward H Adelson. “Discovering states and trans- formations in image collections”. In:Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 1383–1391

work page 2015

[33] [33]

A Review on Methods and Applications in Multimodal Deep Learning

Summaira Jabeen et al. “A Review on Methods and Applications in Multimodal Deep Learning”. In: ACM Trans. Multim. Comput. Commun. Appl. 19.2s (2023), 76:1–76:41. DOI: 10.1145/3545572

work page doi:10.1145/3545572 2023

[34] [34]

Derf: Decomposed radiance fields,

Menglin Jia et al. “Intentonomy: a Dataset and Study towards Human Intent Un- derstanding”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 12981–12991. ISBN : 978-1-66544-509-2. DOI: 10.1109/CVPR46437.2021.01279. (Visited on 02/28/2022). 18 Martinez Pandiani et al. 2024 (Preprint) /

work page doi:10.1109/cvpr46437.2021.01279 2021

[35] [35]

Visual Persuasion: Inferring Communicative Intents of Im- ages

Jungseock Joo et al. “Visual Persuasion: Inferring Communicative Intents of Im- ages”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 216–223. (Visited on 01/18/2022)

work page 2014

[36] [36]

Symbolic image detection using scene and knowledge graphs

Nasrin Kalanat and Adriana Kovashka. “Symbolic image detection using scene and knowledge graphs”. In: arXiv preprint arXiv:2206.04863 (2022)

work page arXiv 2022

[37] [37]

Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation

Kimmo Karkkainen and Jungseock Joo. “Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation”. In: Proceed- ings of the IEEE/CVF winter conference on applications of computer vision. 2021, pp. 1548–1558

work page 2021

[38] [38]

The representation of abstract words: Why emo- tion matters

Stavroula-Thaleia Kousta et al. “The representation of abstract words: Why emo- tion matters”. In: Journal of Experimental Psychology: General 140.1 (2011), pp. 14–34. ISSN : 1939-2222. DOI: 10.1037/a0021446

work page doi:10.1037/a0021446 2011

[39] [39]

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Ranjay Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: arXiv:1602.07332 [cs] 123.1 (Feb. 2016), pp. 32–73. (Visited on 12/14/2021)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li et al. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”. In: International Conference on Machine Learning. PMLR. 2022, pp. 12888–12900

work page 2022

[41] [41]

Dual-Glance Model for Deciphering Social Relationships

Junnan Li et al. “Dual-Glance Model for Deciphering Social Relationships”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 2669–2678. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017.289

work page 2017

[42] [42]

Situation Recognition with Graph Neural Networks

Ruiyu Li et al. “Situation Recognition with Graph Neural Networks”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 4183–4192. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017 . 448

work page 2017

[43] [43]

Graph-Based Social Relation Reasoning

Wanhua Li et al. “Graph-Based Social Relation Reasoning”. In: Computer Vi- sion – ECCV 2020. Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Sci- ence. Cham: Springer International Publishing, 2020, pp. 18–34.ISBN : 978-3-030- 58555-6. DOI: 10.1007/978-3-030-58555-6\_2

work page doi:10.1007/978-3-030-58555-6 2020

[44] [44]

GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph

Xin Li et al. “GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph”. In: CoRR abs/2309.13625 (2023). DOI: 10.48550/ARXIV.2309. 13625. arXiv: 2309.13625

work page doi:10.48550/arxiv.2309 2023

[45] [45]

The artbench dataset: Benchmarking generative models with artworks

Peiyuan Liao et al. “The artbench dataset: Benchmarking generative models with artworks”. In: arXiv preprint arXiv:2206.11404 (2022)

work page arXiv 2022

[46] [46]

Microsoft coco: Common objects in context

Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755

work page 2014

[47] [47]

ConceptNet–a practical commonsense reasoning tool- kit

Hugo Liu and Push Singh. “ConceptNet–a practical commonsense reasoning tool- kit”. In: BT technology journal 22.4 (2004), pp. 211–226

work page 2004

[48] [48]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104

work page 2016

[49] [49]

Collective activity detection using hinge-loss Markov random fields

Ben London et al. “Collective activity detection using hinge-loss Markov random fields”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013, pp. 566–571. Martinez Pandiani et al. 2024 (Preprint) / 19

work page 2013

[50] [50]

The More You Know: Using Knowledge Graphs for Image Classification

Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. “The More You Know: Using Knowledge Graphs for Image Classification”. In: 2017 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 20–28.DOI: 10.1109/ CVPR.2017.10

work page 2017

[51] [51]

Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames

D. S. Martinez Pandiani and V . Presutti. “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames”. In: Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021). Zaragoza, Spain, 2021, arXiv–2110

work page 2021

[52] [52]

Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

D. S. Martinez Pandiani and V . Presutti. “Seeing the Intangible: Survey of Im- age Classification into High-Level and Abstract Categories”. In: arXiv preprint arXiv:2308.10562 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate

D.S. Martinez Pandiani and V . Presutti. “Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate”. In: [Under Review] Spe- cial Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowl- edge Discovery from Data (TKDD) (2024)

work page 2024

[54] [54]

Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data

D.S. Martinez Pandiani et al. “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data”. In: International Journal of Digital Humanities (IJDH) (2023)

work page 2023

[55] [55]

Relative representations enable zero-shot latent space com- munication

Luca Moschella et al. “Relative representations enable zero-shot latent space com- munication”. In: The Eleventh International Conference on Learning Representa- tions. 2022

work page 2022

[56] [56]

ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training

Antonio Norelli et al. “ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training”. In: CoRR abs/2210.01738 (2022). DOI: 10 . 48550 / ARXIV.2210.01738. arXiv: 2210.01738

work page arXiv 2022

[57] [57]

CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

Zachary Novack et al. “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets”. In: International Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA . Ed. by Andreas Krause et al. V ol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 26342–26362

work page 2023

[58] [58]

Grounded Situation Recognition

Sarah Pratt et al. “Grounded Situation Recognition”. In: Computer Vision – ECCV

work page

[59] [59]

by Andrea Vedaldi et al

Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Science. Springer. Cham: Springer International Publishing, 2020, pp. 314–332. ISBN : 978-3-030- 58548-8. DOI: 10.1007/978-3-030-58548-8\_19

work page doi:10.1007/978-3-030-58548-8 2020

[60] [60]

Recognition using visual phrases

Mohammad Amin Sadeghi and Ali Farhadi. “Recognition using visual phrases”. In: Cvpr 2011. Ieee. 2011, pp. 1745–1752

work page 2011

[61] [61]

Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works

Cristina Segalin, Dong Seon Cheng, and Marco Cristani. “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works”. In: Computer Vision and Image Understanding. Image and Video Under- standing in Big Data 156 (Mar. 2017), pp. 34–50. ISSN : 1077-3142. DOI: 10 . 1016/j.cviu.2016.10.013

work page 2017

[62] [62]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015

work page 2015

[63] [63]

Conceptnet 5.5: An open multi- lingual graph of general knowledge

Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multi- lingual graph of general knowledge”. In:Thirty-first AAAI Conference on Artificial Intelligence. 2017. 20 Martinez Pandiani et al. 2024 (Preprint) /

work page 2017

[64] [64]

Mixture-Kernel Graph Attention Network for Situation Recognition

Mohammed Suhail and Leonid Sigal. “Mixture-Kernel Graph Attention Network for Situation Recognition”. In:2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 10362–10371. ISBN : 978-1-72814-803-8. DOI: 10.1109/ICCV.2019.01046

work page doi:10.1109/iccv.2019.01046 2019

[65] [65]

A Domain Based Approach to Social Relation Recognition

Qianru Sun, Bernt Schiele, and Mario Fritz. “A Domain Based Approach to Social Relation Recognition”. In:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 435–444. ISBN : 978-1- 5386-0457-1. DOI: 10.1109/CVPR.2017.54. (Visited on 01/19/2022)

work page doi:10.1109/cvpr.2017.54 2017

[66] [66]

Computer vision: algorithms and applications

Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022

work page 2022

[67] [67]

Knowledge graphs as tools for explainable machine learning: A survey

Ilaria Tiddi and Stefan Schlobach. “Knowledge graphs as tools for explainable machine learning: A survey”. In: Artificial Intelligence 302 (2022), p. 103627

work page 2022

[68] [68]

Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions

Antoine Toisoul et al. “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions”. In:Nature Machine Intelligence3.1 (Jan. 2021), pp. 42–50. ISSN : 2522-5839. DOI: 10.1038/s42256-020-00280-0

work page doi:10.1038/s42256-020-00280-0 2021

[69] [69]

The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011)

Gabriella Vigliocco et al. “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” In: (2013)

work page 2013

[70] [70]

Knowledge graph embedding: A survey of approaches and applications

Quan Wang et al. “Knowledge graph embedding: A survey of approaches and applications”. In: IEEE Transactions on Knowledge and Data Engineering 29.12 (2017), pp. 2724–2743

work page 2017

[71] [71]

Understanding and Map- ping Natural Beauty

Scott Workman, Richard Souvenir, and Nathan Jacobs. “Understanding and Map- ping Natural Beauty”. In: 2017 IEEE International Conference on Computer Vi- sion (ICCV). Venice: IEEE, Oct. 2017, pp. 5590–5599.ISBN : 978-1-5386-1032-9. DOI: 10.1109/ICCV.2017.596

work page doi:10.1109/iccv.2017.596 2017

[72] [72]

Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval

Xingxu Yao et al. “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval”. In: 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1140–1150. ISBN : 978- 1-72814-803-8. DOI: 10.1109/ICCV.2019.00123

work page doi:10.1109/iccv.2019.00123 2019

[73] [73]

Situation Recognition: Visual Semantic Role Labeling for Image Understanding

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation Recognition: Visual Semantic Role Labeling for Image Understanding”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, NV , USA: IEEE, June 2016, pp. 5534–5542. ISBN : 978-1-4673-8851-1. DOI: 10 . 1109 / CVPR . 2016.597

work page 2016

[74] [74]

In: Conference on Robot Learning, pp

K. Ye and A. Kovashka. “ADVISE: Symbolism and External Knowledge for De- coding Advertisements”. In: Computer Vision – ECCV 2018 . Ed. by Vittorio Ferrari et al. V ol. 11219 LNCS. Cham: Springer International Publishing, 2018, pp. 868–886. ISBN : 9783030012663. DOI: 10.1007/978-3-030-01267-0\_51

work page doi:10.1007/978-3-030-01267-0 2018

[75] [75]

Interpreting the Rhetoric of Visual Advertisements

Keren Ye et al. “Interpreting the Rhetoric of Visual Advertisements”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4 (Apr. 2019), pp. 1308–1323. ISSN : 1939-3539. DOI: 10.1109/TPAMI.2019.2947440

work page doi:10.1109/tpami.2019.2947440 2019

[76] [76]

Scaling Vision Transformers

Xiaohua Zhai et al. “Scaling Vision Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 . IEEE, 2022, pp. 1204–1213. DOI: 10 . 1109 / CVPR52688 . 2022.01179

work page arXiv 2022

[77] [77]

Reasoning about object affordances in a knowledge base representation

Yuke Zhu, Alireza Fathi, and Li Fei-Fei. “Reasoning about object affordances in a knowledge base representation”. In: European conference on computer vision . Springer. 2014, pp. 408–424

work page 2014