pith. sign in

arxiv: 2402.19339 · v1 · submitted 2024-02-29 · 💻 cs.CV · cs.AI

Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification

Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords abstract concept classificationknowledge graph embeddingsvision transformersneuro-symbolic methodssituated perceptual knowledgeimage understandinghybrid modelsinterpretability
0
0 comments X

The pith

Fusing knowledge graph embeddings of situated perceptual knowledge with Vision Transformer features improves abstract concept image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that automatically extracting perceptual semantic units from cultural images, modeling them in the ARTstract Knowledge Graph, and fusing the resulting embeddings with visual transformer features produces better performance than existing methods on abstract concept classification. A sympathetic reader would care because standard deep vision models excel at low-level pixel patterns but often miss the context-dependent, semantic understanding humans bring to high-level image interpretation. The work shows complementarity: vision transformers handle sensory attributes while the knowledge graph component represents more abstract scene elements. This hybrid approach is presented as evidence that neuro-symbolic integration can address gaps in current visual comprehension systems for downstream tasks.

Core claim

Hybrid KGE-ViT methods that combine embeddings from the ARTstract Knowledge Graph (built from over 14,000 labeled cultural images and enriched with linguistic frames) with Vision Transformer embeddings outperform existing techniques on abstract concept image classification; posthoc analyses indicate that the visual transformer captures pixel-level attributes while the fused method better represents abstract and semantic scene elements, revealing synergy between situated perceptual knowledge in the KGE and sensory-perceptual understanding in the deep model.

What carries the argument

The ARTstract Knowledge Graph (AKG) that encodes automatically extracted perceptual semantic units and high-level linguistic frames, whose embeddings are fused with Vision Transformer embeddings via relative representations and hybrid approaches.

If this is right

  • The hybrid methods achieve higher accuracy than existing techniques specifically on abstract concept image classification tasks.
  • Posthoc interpretability shows the visual transformer focuses on pixel-level visual attributes while the KGE component handles more abstract and semantic scene elements.
  • The demonstrated synergy between KGE embeddings and ViT features supports the use of neuro-symbolic methods for knowledge integration in visual representation.
  • The approach suggests potential for improved performance on downstream intricate visual comprehension tasks that require both sensory and situated knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the AKG extraction process generalizes beyond the cultural image dataset, the same fusion technique could be applied to other high-level vision domains such as scene understanding in video or medical imaging.
  • The complementarity finding implies that pure scaling of vision transformers may hit limits on tasks requiring explicit semantic context, pointing toward systematic testing of KG fusion on distribution-shift benchmarks.
  • One testable extension is whether the relative representation method used for fusion reduces the need for large amounts of labeled data compared to end-to-end fine-tuning alone.

Load-bearing premise

The automatically extracted perceptual semantic units and resulting ARTstract Knowledge Graph accurately encode situated, context-dependent human knowledge of abstract concepts such that embedding and fusing them with ViT features yields genuine generalization rather than dataset-specific fitting.

What would settle it

A test in which the knowledge graph embeddings are replaced by embeddings from a randomly constructed graph with the same structure; if the hybrid method no longer outperforms the pure ViT baseline on the same test set, the contribution of the situated perceptual knowledge would be falsified.

Figures

Figures reproduced from arXiv: 2402.19339 by Delfina Sol Martinez Pandiani, Nicolas Lazzari, Valentina Presutti.

Figure 1
Figure 1. Figure 1: Subset of the A-Box of ARTstract-KG, showing the types of commonsense linguistic knowledge connected to a single image instance. Most annotations are typed by ConceptNet concepts, while the image captions are typed by WordNet concepts as well as by linguistic frames. 3.2. ARTstract Knowledge Graph Creation We use the SituAnnotate ontology [53], which models the situated assignment of an￾notation labels to … view at source ↗
Figure 2
Figure 2. Figure 2: Macro F1 scores on the AC image classification tasks for different input embeddings. Absolute versus Relative Embeddings RelKGE outperformed absKGE, achieving a higher Macro F1 score of 0.27 compared to 0.22 (see [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Absolute ViT vs. Absolute KGE embeddings capture different aspects of ARTstract images. Top: Absolute ViT captures aspects that resemble the United States flag while KGE captures more landscape-related features, Bottom: Absolute KGE demonstrates superior semantic performance than ViT by encoding similari￾ties with perceptually diverse visions of the Statue of Liberty [PITH_FULL_IMAGE:figures/full_fig_p010… view at source ↗
Figure 4
Figure 4. Figure 4: Contrasting semantic proficiency of Absolute KGE vs. Absolute ViT. The top image illustrates ViT’s focus on colors and textures (aesthetics), whereas KGE excels in recognizing explicit semantics, particularly women sitting on couches. In the bottom image, KGE effectively encodes the semantics of reading a book in the test artwork. Multiple test instances suggest that the KGE method exhibits superior perfor… view at source ↗
Figure 5
Figure 5. Figure 5: ViT misclassifies as death, but KGE successfully associates images with crosses to the concept of comfort, indicating ViT’s focus on colors and textures [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ViT misclassifies as comfort, but KGE successfully associates images with crosses to the concept of death. comfort, likely due to the original image’s warm colors, landscape composition and drawing/cartoon-like drawing features, the top similar images as based on vit feature outdoor scenes irrelevant to the ground truth of death. In contrast, the top three similar images based on KGE embeddings share the c… view at source ↗
Figure 7
Figure 7. Figure 7: Interpretability results for a test image labeled as fitness. Top similar anchors are shown for the test instance using relative ViT embeddings (top row), relative KGE embeddings (middle rows), and hybrid em￾beddings. Shared ARTstract-KG nodes accompany each row. The hybrid embedding integrates complementary information from both relative embeddings to prioritize anchors tagged as fitness. These findings h… view at source ↗
read the original abstract

The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces a neuro-symbolic approach for classifying abstract concepts (ACs) in cultural images. It automatically extracts perceptual semantic units from over 14,000 labeled images to construct the ARTstract Knowledge Graph (AKG), augments it with linguistic frames, computes KG embeddings, and fuses them with Vision Transformer (ViT) features via relative representations and hybrid models. The central claims are that the hybrid KGE-ViT methods outperform prior techniques on AC classification and that post-hoc analyses demonstrate complementarity, with KGE capturing abstract semantic elements and ViT handling pixel-level attributes.

Significance. If the empirical claims hold after addressing validation gaps, the work would contribute to neuro-symbolic computer vision by showing how situated perceptual knowledge from a domain-specific KG can complement sensory features from transformers for high-level semantic tasks. The public release of materials and code is a positive factor that supports reproducibility.

major comments (3)
  1. [Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.
  2. [Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.
  3. [Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.
minor comments (1)
  1. [Abstract] The abstract and introduction use 'ACs' without an initial expansion on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity, rigor, and completeness.

read point-by-point responses
  1. Referee: [Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.

    Authors: We agree that additional validation details would strengthen the presentation. In the revision we will expand the methods section with an explicit description of extraction parameters, an ablation study varying those parameters, and a discussion of potential dataset biases. We will also include a small-scale comparison against a manually reviewed subset of extracted units to provide an independent check, while noting that a full inter-annotator agreement study was outside the original scope. revision: yes

  2. Referee: [Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.

    Authors: The results section reports performance numbers and baseline comparisons, yet we acknowledge that statistical tests and additional ablations would make the evidence more robust. We will revise the section to tabulate all quantitative metrics explicitly, add paired statistical significance tests, and include further ablation experiments isolating the contribution of the KGE component versus embedding dimensionality effects. revision: yes

  3. Referee: [Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.

    Authors: We will augment the interpretability analyses with quantitative support, specifically by reporting distributions of similarity scores for each model type and by adding controlled example pairs with numerical similarity values to demonstrate the differing focus of KGE versus ViT representations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fusion evaluated on held-out data

full rationale

The paper describes an empirical pipeline: automatic extraction of perceptual units from labeled images to construct the ARTstract KG, computation of KG embeddings, fusion with ViT features, and accuracy comparison against baselines. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the hybrid performance result equivalent to its inputs by construction. The central claim rests on standard train/test splits and external resource construction rather than internal re-derivation, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view supplies minimal technical detail; the central claim depends on the unverified quality of the perceptual unit extraction process and the assumption that KG embeddings meaningfully encode situated knowledge.

axioms (1)
  • domain assumption Automatically extracted perceptual semantic units from images can be modeled to capture situated perceptual semantics of abstract concepts.
    Invoked when constructing the AKG and using its embeddings for classification.
invented entities (1)
  • ARTstract Knowledge Graph (AKG) no independent evidence
    purpose: Captures situated perceptual semantics from over 14,000 cultural images labeled with abstract concepts, augmented with linguistic frames.
    Newly constructed resource described in the abstract.

pith-pipeline@v0.9.0 · 5813 in / 1472 out tokens · 44446 ms · 2026-05-24T03:29:20.736367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

  1. [1]

    Derf: Decomposed radiance fields,

    Panos Achlioptas et al. “ArtEmis: Affective Language for Visual Art”. en. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, vir- tual, June 19-25, 2021. Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579. DOI: 10.1109/CVPR46437.2021.01140 . (Visited on 02/09/2022)

  2. [2]

    Explicit reasoning over end-to- end neural architectures for visual question answering

    Somak Aditya, Yezhou Yang, and Chitta Baral. “Explicit reasoning over end-to- end neural architectures for visual question answering”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 32. 2018

  3. [3]

    Integrating knowledge and rea- soning in image understanding

    Somak Aditya, Yezhou Yang, and Chitta Baral. “Integrating knowledge and rea- soning in image understanding”. In: 28th International Joint Conference on Artifi- cial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelli- gence. 2019, pp. 6252–6259

  4. [4]

    A public domain dataset for human activity recognition using smartphones

    Davide Anguita et al. “A public domain dataset for human activity recognition using smartphones.” In: Esann. V ol. 3. 2013, p. 3

  5. [5]

    Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture

    Taylor Arnold and Lauren Tilton. “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture”. In: Journal of Open Source Software 5.45 (Jan. 2020), p. 1800. ISSN : 2475-9066. DOI: 10.21105/joss.01800. (Visited on 12/13/2021)

  6. [6]

    Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases

    Michael van Bekkum et al. “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases”. In: arXiv:2102.11965 [cs] 51.9 (Mar. 2021), pp. 6528–6546. (Visited on 01/20/2022)

  7. [7]

    A Survey on Word Meta-Embedding Learning

    Danushka Bollegala and James O’Neill. “A Survey on Word Meta-Embedding Learning”. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 . Ed. by Luc De Raedt. ijcai.org, 2022, pp. 5402–5409. DOI: 10.24963/IJCAI.2022/758

  8. [8]

    Translating embeddings for modeling multi-relational data

    Antoine Bordes et al. “Translating embeddings for modeling multi-relational data”. In: Advances in neural information processing systems 26 (2013). 16 Martinez Pandiani et al. 2024 (Preprint) /

  9. [9]

    Negative results in computer vision: A perspective

    Ali Borji. “Negative results in computer vision: A perspective”. In: Image and Vision Computing 69 (2018), pp. 1–8

  10. [10]

    Culture and human development: A new look

    Jerome Bruner. “Culture and human development: A new look”. In: Human devel- opment 33.6 (1990), pp. 344–355

  11. [11]

    Scalable Theory-Driven Regularization of Scene Graph Generation Models

    Davide Buffelli and Efthymia Tsamoura. “Scalable Theory-Driven Regularization of Scene Graph Generation Models”. In:Thirty-Seventh AAAI Conference on Arti- ficial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Ad- vances in Artificial Intelligence, EAAI...

  12. [12]

    End-to-end object detection with transformers

    Nicolas Carion et al. “End-to-end object detection with transformers”. In: Euro- pean conference on computer vision. Springer. 2020, pp. 213–229

  13. [13]

    Iterative visual reasoning beyond convolutions

    Xinlei Chen et al. “Iterative visual reasoning beyond convolutions”. In: Proc. of CVPR 2018. IEEE. 2018, pp. 7239–7248

  14. [14]

    Computers in Human Behavior68, 83–95 (2017) https://doi.org/10.1016/j.chb

    F. Ciroku et al. “Automated multimodal sensemaking: Ontology-based integra- tion of linguistic frames and visual data”. In: Computers in Human Behavior 150 (2024), p. 107997. ISSN : 0747-5632. DOI: https://doi.org/10.1016/j.chb. 2023.107997

  15. [15]

    The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia

    Sebastian J Crutch, Basil H Ridha, and Elizabeth K Warrington. “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia”. In: Neurocase 12.3 (2006), pp. 151–163

  16. [16]

    Applying fuzzy DLs in the extraction of image semantics

    Stamatia Dasiopoulou, Ioannis Kompatsiaris, and Michael G Strintzis. “Applying fuzzy DLs in the extraction of image semantics”. In: Journal on data semantics XIV. Springer, 2009, pp. 105–132

  17. [17]

    Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm

    Jon Andoni Du ˜nabeitia et al. “Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm”. In:Cog- nition 110.2 (2009), pp. 284–292

  18. [18]

    Multimodal learning with graphs

    Yasha Ektefaie et al. “Multimodal learning with graphs”. In: Nat. Mac. Intell. 5.4 (2023), pp. 340–350. DOI: 10.1038/S42256-023-00624-6

  19. [19]

    Cognition does not affect perception: Evalu- ating the evidence for “top-down

    Chaz Firestone and Brian J Scholl. “Cognition does not affect perception: Evalu- ating the evidence for “top-down” effects”. In: Behavioral and brain sciences 39 (2016)

  20. [20]

    N., Abdrasheva, G

    Aldo Gangemi et al. “Framester: A wide coverage linguistic linked data hub”. en. In: European Knowledge Acquisition Workshop. Ed. by Eva Blomqvist et al. Lec- ture Notes in Computer Science. Springer. Cham: Springer International Publish- ing, 2016, pp. 239–254. ISBN : 978-3-319-49004-5. DOI: 10.1007/978-3-319- 49004-5\_16

  21. [21]

    An End-To-End Network for Gen- erating Social Relationship Graphs

    Arushi Goel, Keng Teck Ma, and Cheston Tan. “An End-To-End Network for Gen- erating Social Relationship Graphs”. In:2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 11178–11187. ISBN : 978-1-72813-293-8. DOI: 10.1109/CVPR.2019.01144. Martinez Pandiani et al. 2024 (Preprint) / 17

  22. [22]

    In: Proc

    Douglas Gray et al. “Predicting Facial Beauty without Landmarks”. In: Com- puter Vision – ECCV 2010. Ed. by Kostas Daniilidis, Petros Maragos, and Nikos Paragios. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 434–447. ISBN : 978-3-642-15567-3. DOI: 10.1007/978- 3- 642- 15567- 3\_32

  23. [23]

    Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility

    Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. “Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility”. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 952–962. ISBN : 978-1-66542-812-5. DOI: 10 . 1109 / ICCV48922 . 2021 . 00101. (Visited on 03/03/2022)

  24. [24]

    Deep multimodal represen- tation learning: A survey

    Wenzhong Guo, Jianwen Wang, and Shiping Wang. “Deep multimodal represen- tation learning: A survey”. In: IEEE Access 7 (2019), pp. 63373–63394

  25. [25]

    ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge

    Catherine Havasi, Robert Speer, and Jason Alonso. “ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge”. In:Recent advances in natural language processing. John Benjamins Philadelphia, PA. 2007, pp. 27–29

  26. [26]

    Deep residual learning for image recognition,

    Kaiming He et al. “Deep Residual Learning for Image Recognition”. en. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Ve- gas, NV , USA: IEEE, June 2016, pp. 770–778. ISBN : 978-1-4673-8851-1. DOI: 10.1109/CVPR.2016.90. (Visited on 02/15/2022)

  27. [27]

    Concepts, control, and context: A connectionist account of normal and disordered semantic cognition

    Paul Hoffman. “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” en. In: Psychological Review 125.3 (2018), p. 293. ISSN : 1939-1471. DOI: 10.1037/rev0000094. (Visited on 12/13/2021)

  28. [28]

    Putting objects in perspec- tive

    Derek Hoiem, Alexei A Efros, and Martial Hebert. “Putting objects in perspec- tive”. In: International Journal of Computer Vision 80 (2008), pp. 3–15

  29. [29]

    Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features

    X. Huang and A. Kovashka. “Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features”. In:IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition Workshops. 2016, pp. 778–784. ISBN : 978-1-4673- 8850-4. DOI: 10.1109/CVPRW.2016.102

  30. [30]

    Automatic Understanding of Image and Video Advertise- ments

    Zaeem Hussain et al. “Automatic Understanding of Image and Video Advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715

  31. [31]

    Automatic understanding of image and video advertise- ments

    Zaeem Hussain et al. “Automatic understanding of image and video advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715. (Visited on 01/18/2022)

  32. [32]

    Discovering states and trans- formations in image collections

    Phillip Isola, Joseph J Lim, and Edward H Adelson. “Discovering states and trans- formations in image collections”. In:Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 1383–1391

  33. [33]

    A Review on Methods and Applications in Multimodal Deep Learning

    Summaira Jabeen et al. “A Review on Methods and Applications in Multimodal Deep Learning”. In: ACM Trans. Multim. Comput. Commun. Appl. 19.2s (2023), 76:1–76:41. DOI: 10.1145/3545572

  34. [34]

    Derf: Decomposed radiance fields,

    Menglin Jia et al. “Intentonomy: a Dataset and Study towards Human Intent Un- derstanding”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 12981–12991. ISBN : 978-1-66544-509-2. DOI: 10.1109/CVPR46437.2021.01279. (Visited on 02/28/2022). 18 Martinez Pandiani et al. 2024 (Preprint) /

  35. [35]

    Visual Persuasion: Inferring Communicative Intents of Im- ages

    Jungseock Joo et al. “Visual Persuasion: Inferring Communicative Intents of Im- ages”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 216–223. (Visited on 01/18/2022)

  36. [36]

    Symbolic image detection using scene and knowledge graphs

    Nasrin Kalanat and Adriana Kovashka. “Symbolic image detection using scene and knowledge graphs”. In: arXiv preprint arXiv:2206.04863 (2022)

  37. [37]

    Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation

    Kimmo Karkkainen and Jungseock Joo. “Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation”. In: Proceed- ings of the IEEE/CVF winter conference on applications of computer vision. 2021, pp. 1548–1558

  38. [38]

    The representation of abstract words: Why emo- tion matters

    Stavroula-Thaleia Kousta et al. “The representation of abstract words: Why emo- tion matters”. In: Journal of Experimental Psychology: General 140.1 (2011), pp. 14–34. ISSN : 1939-2222. DOI: 10.1037/a0021446

  39. [39]

    Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

    Ranjay Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: arXiv:1602.07332 [cs] 123.1 (Feb. 2016), pp. 32–73. (Visited on 12/14/2021)

  40. [40]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li et al. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”. In: International Conference on Machine Learning. PMLR. 2022, pp. 12888–12900

  41. [41]

    Dual-Glance Model for Deciphering Social Relationships

    Junnan Li et al. “Dual-Glance Model for Deciphering Social Relationships”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 2669–2678. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017.289

  42. [42]

    Situation Recognition with Graph Neural Networks

    Ruiyu Li et al. “Situation Recognition with Graph Neural Networks”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 4183–4192. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017 . 448

  43. [43]

    Graph-Based Social Relation Reasoning

    Wanhua Li et al. “Graph-Based Social Relation Reasoning”. In: Computer Vi- sion – ECCV 2020. Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Sci- ence. Cham: Springer International Publishing, 2020, pp. 18–34.ISBN : 978-3-030- 58555-6. DOI: 10.1007/978-3-030-58555-6\_2

  44. [44]

    GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph

    Xin Li et al. “GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph”. In: CoRR abs/2309.13625 (2023). DOI: 10.48550/ARXIV.2309. 13625. arXiv: 2309.13625

  45. [45]

    The artbench dataset: Benchmarking generative models with artworks

    Peiyuan Liao et al. “The artbench dataset: Benchmarking generative models with artworks”. In: arXiv preprint arXiv:2206.11404 (2022)

  46. [46]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755

  47. [47]

    ConceptNet–a practical commonsense reasoning tool- kit

    Hugo Liu and Push Singh. “ConceptNet–a practical commonsense reasoning tool- kit”. In: BT technology journal 22.4 (2004), pp. 211–226

  48. [48]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104

  49. [49]

    Collective activity detection using hinge-loss Markov random fields

    Ben London et al. “Collective activity detection using hinge-loss Markov random fields”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013, pp. 566–571. Martinez Pandiani et al. 2024 (Preprint) / 19

  50. [50]

    The More You Know: Using Knowledge Graphs for Image Classification

    Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. “The More You Know: Using Knowledge Graphs for Image Classification”. In: 2017 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 20–28.DOI: 10.1109/ CVPR.2017.10

  51. [51]

    Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames

    D. S. Martinez Pandiani and V . Presutti. “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames”. In: Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021). Zaragoza, Spain, 2021, arXiv–2110

  52. [52]

    Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories

    D. S. Martinez Pandiani and V . Presutti. “Seeing the Intangible: Survey of Im- age Classification into High-Level and Abstract Categories”. In: arXiv preprint arXiv:2308.10562 (2023)

  53. [53]

    Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate

    D.S. Martinez Pandiani and V . Presutti. “Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate”. In: [Under Review] Spe- cial Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowl- edge Discovery from Data (TKDD) (2024)

  54. [54]

    Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data

    D.S. Martinez Pandiani et al. “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data”. In: International Journal of Digital Humanities (IJDH) (2023)

  55. [55]

    Relative representations enable zero-shot latent space com- munication

    Luca Moschella et al. “Relative representations enable zero-shot latent space com- munication”. In: The Eleventh International Conference on Learning Representa- tions. 2022

  56. [56]

    ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training

    Antonio Norelli et al. “ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training”. In: CoRR abs/2210.01738 (2022). DOI: 10 . 48550 / ARXIV.2210.01738. arXiv: 2210.01738

  57. [57]

    CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets

    Zachary Novack et al. “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets”. In: International Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA . Ed. by Andreas Krause et al. V ol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 26342–26362

  58. [58]

    Grounded Situation Recognition

    Sarah Pratt et al. “Grounded Situation Recognition”. In: Computer Vision – ECCV

  59. [59]

    by Andrea Vedaldi et al

    Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Science. Springer. Cham: Springer International Publishing, 2020, pp. 314–332. ISBN : 978-3-030- 58548-8. DOI: 10.1007/978-3-030-58548-8\_19

  60. [60]

    Recognition using visual phrases

    Mohammad Amin Sadeghi and Ali Farhadi. “Recognition using visual phrases”. In: Cvpr 2011. Ieee. 2011, pp. 1745–1752

  61. [61]

    Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works

    Cristina Segalin, Dong Seon Cheng, and Marco Cristani. “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works”. In: Computer Vision and Image Understanding. Image and Video Under- standing in Big Data 156 (Mar. 2017), pp. 34–50. ISSN : 1077-3142. DOI: 10 . 1016/j.cviu.2016.10.013

  62. [62]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015

  63. [63]

    Conceptnet 5.5: An open multi- lingual graph of general knowledge

    Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multi- lingual graph of general knowledge”. In:Thirty-first AAAI Conference on Artificial Intelligence. 2017. 20 Martinez Pandiani et al. 2024 (Preprint) /

  64. [64]

    Mixture-Kernel Graph Attention Network for Situation Recognition

    Mohammed Suhail and Leonid Sigal. “Mixture-Kernel Graph Attention Network for Situation Recognition”. In:2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 10362–10371. ISBN : 978-1-72814-803-8. DOI: 10.1109/ICCV.2019.01046

  65. [65]

    A Domain Based Approach to Social Relation Recognition

    Qianru Sun, Bernt Schiele, and Mario Fritz. “A Domain Based Approach to Social Relation Recognition”. In:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 435–444. ISBN : 978-1- 5386-0457-1. DOI: 10.1109/CVPR.2017.54. (Visited on 01/19/2022)

  66. [66]

    Computer vision: algorithms and applications

    Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022

  67. [67]

    Knowledge graphs as tools for explainable machine learning: A survey

    Ilaria Tiddi and Stefan Schlobach. “Knowledge graphs as tools for explainable machine learning: A survey”. In: Artificial Intelligence 302 (2022), p. 103627

  68. [68]

    Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions

    Antoine Toisoul et al. “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions”. In:Nature Machine Intelligence3.1 (Jan. 2021), pp. 42–50. ISSN : 2522-5839. DOI: 10.1038/s42256-020-00280-0

  69. [69]

    The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011)

    Gabriella Vigliocco et al. “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” In: (2013)

  70. [70]

    Knowledge graph embedding: A survey of approaches and applications

    Quan Wang et al. “Knowledge graph embedding: A survey of approaches and applications”. In: IEEE Transactions on Knowledge and Data Engineering 29.12 (2017), pp. 2724–2743

  71. [71]

    Understanding and Map- ping Natural Beauty

    Scott Workman, Richard Souvenir, and Nathan Jacobs. “Understanding and Map- ping Natural Beauty”. In: 2017 IEEE International Conference on Computer Vi- sion (ICCV). Venice: IEEE, Oct. 2017, pp. 5590–5599.ISBN : 978-1-5386-1032-9. DOI: 10.1109/ICCV.2017.596

  72. [72]

    Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval

    Xingxu Yao et al. “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval”. In: 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1140–1150. ISBN : 978- 1-72814-803-8. DOI: 10.1109/ICCV.2019.00123

  73. [73]

    Situation Recognition: Visual Semantic Role Labeling for Image Understanding

    Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation Recognition: Visual Semantic Role Labeling for Image Understanding”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, NV , USA: IEEE, June 2016, pp. 5534–5542. ISBN : 978-1-4673-8851-1. DOI: 10 . 1109 / CVPR . 2016.597

  74. [74]

    In: Conference on Robot Learning, pp

    K. Ye and A. Kovashka. “ADVISE: Symbolism and External Knowledge for De- coding Advertisements”. In: Computer Vision – ECCV 2018 . Ed. by Vittorio Ferrari et al. V ol. 11219 LNCS. Cham: Springer International Publishing, 2018, pp. 868–886. ISBN : 9783030012663. DOI: 10.1007/978-3-030-01267-0\_51

  75. [75]

    Interpreting the Rhetoric of Visual Advertisements

    Keren Ye et al. “Interpreting the Rhetoric of Visual Advertisements”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4 (Apr. 2019), pp. 1308–1323. ISSN : 1939-3539. DOI: 10.1109/TPAMI.2019.2947440

  76. [76]

    Scaling Vision Transformers

    Xiaohua Zhai et al. “Scaling Vision Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 . IEEE, 2022, pp. 1204–1213. DOI: 10 . 1109 / CVPR52688 . 2022.01179

  77. [77]

    Reasoning about object affordances in a knowledge base representation

    Yuke Zhu, Alireza Fathi, and Li Fei-Fei. “Reasoning about object affordances in a knowledge base representation”. In: European conference on computer vision . Springer. 2014, pp. 408–424