Stitching Gaps: Fusing Situated Perceptual Knowledge with Vision Transformers for High-Level Image Classification
Pith reviewed 2026-05-24 03:29 UTC · model grok-4.3
The pith
Fusing knowledge graph embeddings of situated perceptual knowledge with Vision Transformer features improves abstract concept image classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hybrid KGE-ViT methods that combine embeddings from the ARTstract Knowledge Graph (built from over 14,000 labeled cultural images and enriched with linguistic frames) with Vision Transformer embeddings outperform existing techniques on abstract concept image classification; posthoc analyses indicate that the visual transformer captures pixel-level attributes while the fused method better represents abstract and semantic scene elements, revealing synergy between situated perceptual knowledge in the KGE and sensory-perceptual understanding in the deep model.
What carries the argument
The ARTstract Knowledge Graph (AKG) that encodes automatically extracted perceptual semantic units and high-level linguistic frames, whose embeddings are fused with Vision Transformer embeddings via relative representations and hybrid approaches.
If this is right
- The hybrid methods achieve higher accuracy than existing techniques specifically on abstract concept image classification tasks.
- Posthoc interpretability shows the visual transformer focuses on pixel-level visual attributes while the KGE component handles more abstract and semantic scene elements.
- The demonstrated synergy between KGE embeddings and ViT features supports the use of neuro-symbolic methods for knowledge integration in visual representation.
- The approach suggests potential for improved performance on downstream intricate visual comprehension tasks that require both sensory and situated knowledge.
Where Pith is reading between the lines
- If the AKG extraction process generalizes beyond the cultural image dataset, the same fusion technique could be applied to other high-level vision domains such as scene understanding in video or medical imaging.
- The complementarity finding implies that pure scaling of vision transformers may hit limits on tasks requiring explicit semantic context, pointing toward systematic testing of KG fusion on distribution-shift benchmarks.
- One testable extension is whether the relative representation method used for fusion reduces the need for large amounts of labeled data compared to end-to-end fine-tuning alone.
Load-bearing premise
The automatically extracted perceptual semantic units and resulting ARTstract Knowledge Graph accurately encode situated, context-dependent human knowledge of abstract concepts such that embedding and fusing them with ViT features yields genuine generalization rather than dataset-specific fitting.
What would settle it
A test in which the knowledge graph embeddings are replaced by embeddings from a randomly constructed graph with the same structure; if the hybrid method no longer outperforms the pure ViT baseline on the same test set, the contribution of the situated perceptual knowledge would be falsified.
Figures
read the original abstract
The increasing demand for automatic high-level image understanding, particularly in detecting abstract concepts (AC) within images, underscores the necessity for innovative and more interpretable approaches. These approaches need to harmonize traditional deep vision methods with the nuanced, context-dependent knowledge humans employ to interpret images at intricate semantic levels. In this work, we leverage situated perceptual knowledge of cultural images to enhance performance and interpretability in AC image classification. We automatically extract perceptual semantic units from images, which we then model and integrate into the ARTstract Knowledge Graph (AKG). This resource captures situated perceptual semantics gleaned from over 14,000 cultural images labeled with ACs. Additionally, we enhance the AKG with high-level linguistic frames. We compute KG embeddings and experiment with relative representations and hybrid approaches that fuse these embeddings with visual transformer embeddings. Finally, for interpretability, we conduct posthoc qualitative analyses by examining model similarities with training instances. Our results show that our hybrid KGE-ViT methods outperform existing techniques in AC image classification. The posthoc interpretability analyses reveal the visual transformer's proficiency in capturing pixel-level visual attributes, contrasting with our method's efficacy in representing more abstract and semantic scene elements. We demonstrate the synergy and complementarity between KGE embeddings' situated perceptual knowledge and deep visual model's sensory-perceptual understanding for AC image classification. This work suggests a strong potential of neuro-symbolic methods for knowledge integration and robust image representation for use in downstream intricate visual comprehension tasks. All the materials and code are available online.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a neuro-symbolic approach for classifying abstract concepts (ACs) in cultural images. It automatically extracts perceptual semantic units from over 14,000 labeled images to construct the ARTstract Knowledge Graph (AKG), augments it with linguistic frames, computes KG embeddings, and fuses them with Vision Transformer (ViT) features via relative representations and hybrid models. The central claims are that the hybrid KGE-ViT methods outperform prior techniques on AC classification and that post-hoc analyses demonstrate complementarity, with KGE capturing abstract semantic elements and ViT handling pixel-level attributes.
Significance. If the empirical claims hold after addressing validation gaps, the work would contribute to neuro-symbolic computer vision by showing how situated perceptual knowledge from a domain-specific KG can complement sensory features from transformers for high-level semantic tasks. The public release of materials and code is a positive factor that supports reproducibility.
major comments (3)
- [Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.
- [Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.
- [Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.
minor comments (1)
- [Abstract] The abstract and introduction use 'ACs' without an initial expansion on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve clarity, rigor, and completeness.
read point-by-point responses
-
Referee: [Methods (AKG construction and embedding)] Methods section describing AKG construction: The automatic extraction of perceptual semantic units and their modeling into the ARTstract Knowledge Graph is presented without any human validation, inter-annotator agreement scores, ablation on extraction parameters, or comparison to an independent gold standard. This is load-bearing for the claim that the AKG encodes 'situated perceptual knowledge' rather than extraction artifacts or dataset biases, directly affecting the interpretation of any performance gains from fusion.
Authors: We agree that additional validation details would strengthen the presentation. In the revision we will expand the methods section with an explicit description of extraction parameters, an ablation study varying those parameters, and a discussion of potential dataset biases. We will also include a small-scale comparison against a manually reviewed subset of extracted units to provide an independent check, while noting that a full inter-annotator agreement study was outside the original scope. revision: yes
-
Referee: [Results and Experiments] Results section: The assertion that hybrid KGE-ViT methods outperform existing techniques is not accompanied by the specific quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies needed to evaluate the central empirical claim. Without these, it is impossible to determine whether observed improvements arise from genuine complementarity or from auxiliary embedding effects.
Authors: The results section reports performance numbers and baseline comparisons, yet we acknowledge that statistical tests and additional ablations would make the evidence more robust. We will revise the section to tabulate all quantitative metrics explicitly, add paired statistical significance tests, and include further ablation experiments isolating the contribution of the KGE component versus embedding dimensionality effects. revision: yes
-
Referee: [Interpretability analyses] Interpretability analyses: The post-hoc qualitative comparison of model similarities with training instances is used to contrast ViT's pixel-level focus with the method's semantic focus, but no quantitative measures (e.g., similarity score distributions or controlled examples) are supplied to substantiate the claimed complementarity.
Authors: We will augment the interpretability analyses with quantitative support, specifically by reporting distributions of similarity scores for each model type and by adding controlled example pairs with numerical similarity values to demonstrate the differing focus of KGE versus ViT representations. revision: yes
Circularity Check
No significant circularity; empirical fusion evaluated on held-out data
full rationale
The paper describes an empirical pipeline: automatic extraction of perceptual units from labeled images to construct the ARTstract KG, computation of KG embeddings, fusion with ViT features, and accuracy comparison against baselines. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the hybrid performance result equivalent to its inputs by construction. The central claim rests on standard train/test splits and external resource construction rather than internal re-derivation, so the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Automatically extracted perceptual semantic units from images can be modeled to capture situated perceptual semantics of abstract concepts.
invented entities (1)
-
ARTstract Knowledge Graph (AKG)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Derf: Decomposed radiance fields,
Panos Achlioptas et al. “ArtEmis: Affective Language for Visual Art”. en. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, vir- tual, June 19-25, 2021. Nashville, TN, USA: Computer Vision Foundation / IEEE, 2021, pp. 11569–11579. DOI: 10.1109/CVPR46437.2021.01140 . (Visited on 02/09/2022)
-
[2]
Explicit reasoning over end-to- end neural architectures for visual question answering
Somak Aditya, Yezhou Yang, and Chitta Baral. “Explicit reasoning over end-to- end neural architectures for visual question answering”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 32. 2018
work page 2018
-
[3]
Integrating knowledge and rea- soning in image understanding
Somak Aditya, Yezhou Yang, and Chitta Baral. “Integrating knowledge and rea- soning in image understanding”. In: 28th International Joint Conference on Artifi- cial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelli- gence. 2019, pp. 6252–6259
work page 2019
-
[4]
A public domain dataset for human activity recognition using smartphones
Davide Anguita et al. “A public domain dataset for human activity recognition using smartphones.” In: Esann. V ol. 3. 2013, p. 3
work page 2013
-
[5]
Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture
Taylor Arnold and Lauren Tilton. “Distant Viewing Toolkit: A Python Package for the Analysis of Visual Culture”. In: Journal of Open Source Software 5.45 (Jan. 2020), p. 1800. ISSN : 2475-9066. DOI: 10.21105/joss.01800. (Visited on 12/13/2021)
-
[6]
Michael van Bekkum et al. “Modular Design Patterns for Hybrid Learning and Reasoning Systems: a taxonomy, patterns and use cases”. In: arXiv:2102.11965 [cs] 51.9 (Mar. 2021), pp. 6528–6546. (Visited on 01/20/2022)
-
[7]
A Survey on Word Meta-Embedding Learning
Danushka Bollegala and James O’Neill. “A Survey on Word Meta-Embedding Learning”. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022 . Ed. by Luc De Raedt. ijcai.org, 2022, pp. 5402–5409. DOI: 10.24963/IJCAI.2022/758
-
[8]
Translating embeddings for modeling multi-relational data
Antoine Bordes et al. “Translating embeddings for modeling multi-relational data”. In: Advances in neural information processing systems 26 (2013). 16 Martinez Pandiani et al. 2024 (Preprint) /
work page 2013
-
[9]
Negative results in computer vision: A perspective
Ali Borji. “Negative results in computer vision: A perspective”. In: Image and Vision Computing 69 (2018), pp. 1–8
work page 2018
-
[10]
Culture and human development: A new look
Jerome Bruner. “Culture and human development: A new look”. In: Human devel- opment 33.6 (1990), pp. 344–355
work page 1990
-
[11]
Scalable Theory-Driven Regularization of Scene Graph Generation Models
Davide Buffelli and Efthymia Tsamoura. “Scalable Theory-Driven Regularization of Scene Graph Generation Models”. In:Thirty-Seventh AAAI Conference on Arti- ficial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Ad- vances in Artificial Intelligence, EAAI...
-
[12]
End-to-end object detection with transformers
Nicolas Carion et al. “End-to-end object detection with transformers”. In: Euro- pean conference on computer vision. Springer. 2020, pp. 213–229
work page 2020
-
[13]
Iterative visual reasoning beyond convolutions
Xinlei Chen et al. “Iterative visual reasoning beyond convolutions”. In: Proc. of CVPR 2018. IEEE. 2018, pp. 7239–7248
work page 2018
-
[14]
Computers in Human Behavior68, 83–95 (2017) https://doi.org/10.1016/j.chb
F. Ciroku et al. “Automated multimodal sensemaking: Ontology-based integra- tion of linguistic frames and visual data”. In: Computers in Human Behavior 150 (2024), p. 107997. ISSN : 0747-5632. DOI: https://doi.org/10.1016/j.chb. 2023.107997
-
[15]
Sebastian J Crutch, Basil H Ridha, and Elizabeth K Warrington. “The different frameworks underlying abstract and concrete knowledge: Evidence from a bilin- gual patient with a semantic refractory access dysphasia”. In: Neurocase 12.3 (2006), pp. 151–163
work page 2006
-
[16]
Applying fuzzy DLs in the extraction of image semantics
Stamatia Dasiopoulou, Ioannis Kompatsiaris, and Michael G Strintzis. “Applying fuzzy DLs in the extraction of image semantics”. In: Journal on data semantics XIV. Springer, 2009, pp. 105–132
work page 2009
-
[17]
Jon Andoni Du ˜nabeitia et al. “Qualitative differences in the representation of ab- stract versus concrete words: Evidence from the visual-world paradigm”. In:Cog- nition 110.2 (2009), pp. 284–292
work page 2009
-
[18]
Multimodal learning with graphs
Yasha Ektefaie et al. “Multimodal learning with graphs”. In: Nat. Mac. Intell. 5.4 (2023), pp. 340–350. DOI: 10.1038/S42256-023-00624-6
-
[19]
Cognition does not affect perception: Evalu- ating the evidence for “top-down
Chaz Firestone and Brian J Scholl. “Cognition does not affect perception: Evalu- ating the evidence for “top-down” effects”. In: Behavioral and brain sciences 39 (2016)
work page 2016
-
[20]
Aldo Gangemi et al. “Framester: A wide coverage linguistic linked data hub”. en. In: European Knowledge Acquisition Workshop. Ed. by Eva Blomqvist et al. Lec- ture Notes in Computer Science. Springer. Cham: Springer International Publish- ing, 2016, pp. 239–254. ISBN : 978-3-319-49004-5. DOI: 10.1007/978-3-319- 49004-5\_16
-
[21]
An End-To-End Network for Gen- erating Social Relationship Graphs
Arushi Goel, Keng Teck Ma, and Cheston Tan. “An End-To-End Network for Gen- erating Social Relationship Graphs”. In:2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 11178–11187. ISBN : 978-1-72813-293-8. DOI: 10.1109/CVPR.2019.01144. Martinez Pandiani et al. 2024 (Preprint) / 17
-
[22]
Douglas Gray et al. “Predicting Facial Beauty without Landmarks”. In: Com- puter Vision – ECCV 2010. Ed. by Kostas Daniilidis, Petros Maragos, and Nikos Paragios. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 434–447. ISBN : 978-3-642-15567-3. DOI: 10.1007/978- 3- 642- 15567- 3\_32
-
[23]
Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility
Meiqi Guo, Rebecca Hwa, and Adriana Kovashka. “Detecting Persuasive Atypi- cality by Modeling Contextual Compatibility”. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 952–962. ISBN : 978-1-66542-812-5. DOI: 10 . 1109 / ICCV48922 . 2021 . 00101. (Visited on 03/03/2022)
work page 2021
-
[24]
Deep multimodal represen- tation learning: A survey
Wenzhong Guo, Jianwen Wang, and Shiping Wang. “Deep multimodal represen- tation learning: A survey”. In: IEEE Access 7 (2019), pp. 63373–63394
work page 2019
-
[25]
ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge
Catherine Havasi, Robert Speer, and Jason Alonso. “ConceptNet 3: a flexible, mul- tilingual semantic network for common sense knowledge”. In:Recent advances in natural language processing. John Benjamins Philadelphia, PA. 2007, pp. 27–29
work page 2007
-
[26]
Deep residual learning for image recognition,
Kaiming He et al. “Deep Residual Learning for Image Recognition”. en. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Ve- gas, NV , USA: IEEE, June 2016, pp. 770–778. ISBN : 978-1-4673-8851-1. DOI: 10.1109/CVPR.2016.90. (Visited on 02/15/2022)
-
[27]
Concepts, control, and context: A connectionist account of normal and disordered semantic cognition
Paul Hoffman. “Concepts, control, and context: A connectionist account of normal and disordered semantic cognition.” en. In: Psychological Review 125.3 (2018), p. 293. ISSN : 1939-1471. DOI: 10.1037/rev0000094. (Visited on 12/13/2021)
-
[28]
Putting objects in perspec- tive
Derek Hoiem, Alexei A Efros, and Martial Hebert. “Putting objects in perspec- tive”. In: International Journal of Computer Vision 80 (2008), pp. 3–15
work page 2008
-
[29]
Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features
X. Huang and A. Kovashka. “Inferring Visual Persuasion via Body Language, Set- ting, and Deep Features”. In:IEEE Computer Society Conference on Computer Vi- sion and Pattern Recognition Workshops. 2016, pp. 778–784. ISBN : 978-1-4673- 8850-4. DOI: 10.1109/CVPRW.2016.102
-
[30]
Automatic Understanding of Image and Video Advertise- ments
Zaeem Hussain et al. “Automatic Understanding of Image and Video Advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715
work page 2017
-
[31]
Automatic understanding of image and video advertise- ments
Zaeem Hussain et al. “Automatic understanding of image and video advertise- ments”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1705–1715. (Visited on 01/18/2022)
work page 2017
-
[32]
Discovering states and trans- formations in image collections
Phillip Isola, Joseph J Lim, and Edward H Adelson. “Discovering states and trans- formations in image collections”. In:Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 1383–1391
work page 2015
-
[33]
A Review on Methods and Applications in Multimodal Deep Learning
Summaira Jabeen et al. “A Review on Methods and Applications in Multimodal Deep Learning”. In: ACM Trans. Multim. Comput. Commun. Appl. 19.2s (2023), 76:1–76:41. DOI: 10.1145/3545572
-
[34]
Derf: Decomposed radiance fields,
Menglin Jia et al. “Intentonomy: a Dataset and Study towards Human Intent Un- derstanding”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, June 2021, pp. 12981–12991. ISBN : 978-1-66544-509-2. DOI: 10.1109/CVPR46437.2021.01279. (Visited on 02/28/2022). 18 Martinez Pandiani et al. 2024 (Preprint) /
-
[35]
Visual Persuasion: Inferring Communicative Intents of Im- ages
Jungseock Joo et al. “Visual Persuasion: Inferring Communicative Intents of Im- ages”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 216–223. (Visited on 01/18/2022)
work page 2014
-
[36]
Symbolic image detection using scene and knowledge graphs
Nasrin Kalanat and Adriana Kovashka. “Symbolic image detection using scene and knowledge graphs”. In: arXiv preprint arXiv:2206.04863 (2022)
-
[37]
Kimmo Karkkainen and Jungseock Joo. “Fairface: Face attribute dataset for bal- anced race, gender, and age for bias measurement and mitigation”. In: Proceed- ings of the IEEE/CVF winter conference on applications of computer vision. 2021, pp. 1548–1558
work page 2021
-
[38]
The representation of abstract words: Why emo- tion matters
Stavroula-Thaleia Kousta et al. “The representation of abstract words: Why emo- tion matters”. In: Journal of Experimental Psychology: General 140.1 (2011), pp. 14–34. ISSN : 1939-2222. DOI: 10.1037/a0021446
-
[39]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Ranjay Krishna et al. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: arXiv:1602.07332 [cs] 123.1 (Feb. 2016), pp. 32–73. (Visited on 12/14/2021)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
Junnan Li et al. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation”. In: International Conference on Machine Learning. PMLR. 2022, pp. 12888–12900
work page 2022
-
[41]
Dual-Glance Model for Deciphering Social Relationships
Junnan Li et al. “Dual-Glance Model for Deciphering Social Relationships”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 2669–2678. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017.289
work page 2017
-
[42]
Situation Recognition with Graph Neural Networks
Ruiyu Li et al. “Situation Recognition with Graph Neural Networks”. In: 2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, Oct. 2017, pp. 4183–4192. ISBN : 978-1-5386-1032-9. DOI: 10 . 1109 / ICCV . 2017 . 448
work page 2017
-
[43]
Graph-Based Social Relation Reasoning
Wanhua Li et al. “Graph-Based Social Relation Reasoning”. In: Computer Vi- sion – ECCV 2020. Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Sci- ence. Cham: Springer International Publishing, 2020, pp. 18–34.ISBN : 978-3-030- 58555-6. DOI: 10.1007/978-3-030-58555-6\_2
-
[44]
GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph
Xin Li et al. “GraphAdapter: Tuning Vision-Language Models With Dual Knowl- edge Graph”. In: CoRR abs/2309.13625 (2023). DOI: 10.48550/ARXIV.2309. 13625. arXiv: 2309.13625
-
[45]
The artbench dataset: Benchmarking generative models with artworks
Peiyuan Liao et al. “The artbench dataset: Benchmarking generative models with artworks”. In: arXiv preprint arXiv:2206.11404 (2022)
-
[46]
Microsoft coco: Common objects in context
Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755
work page 2014
-
[47]
ConceptNet–a practical commonsense reasoning tool- kit
Hugo Liu and Push Singh. “ConceptNet–a practical commonsense reasoning tool- kit”. In: BT technology journal 22.4 (2004), pp. 211–226
work page 2004
-
[48]
Deepfashion: Powering robust clothes recognition and retrieval with rich annotations
Ziwei Liu et al. “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations”. In:Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1096–1104
work page 2016
-
[49]
Collective activity detection using hinge-loss Markov random fields
Ben London et al. “Collective activity detection using hinge-loss Markov random fields”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2013, pp. 566–571. Martinez Pandiani et al. 2024 (Preprint) / 19
work page 2013
-
[50]
The More You Know: Using Knowledge Graphs for Image Classification
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. “The More You Know: Using Knowledge Graphs for Image Classification”. In: 2017 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 20–28.DOI: 10.1109/ CVPR.2017.10
work page 2017
-
[51]
Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames
D. S. Martinez Pandiani and V . Presutti. “Automatic Modeling of Social Concepts Evoked by Art Images as Multimodal Frames”. In: Proceedings of the Workshops and Tutorials held at LDK 2021 co-located with the 3rd Language, Data and Knowledge Conference (LDK 2021). Zaragoza, Spain, 2021, arXiv–2110
work page 2021
-
[52]
Seeing the Intangible: Survey of Image Classification into High-Level and Abstract Categories
D. S. Martinez Pandiani and V . Presutti. “Seeing the Intangible: Survey of Im- age Classification into High-Level and Abstract Categories”. In: arXiv preprint arXiv:2308.10562 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate
D.S. Martinez Pandiani and V . Presutti. “Situated Ground Truths: Enhancing Bias- Aware AI by Situating Data Labels with SituAnnotate”. In: [Under Review] Spe- cial Issue on Trustworthy Artificial Intelligence of ACM Transactions on Knowl- edge Discovery from Data (TKDD) (2024)
work page 2024
-
[54]
Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data
D.S. Martinez Pandiani et al. “Hypericons for Interpretability: Decoding Abstract Concepts in Visual Data”. In: International Journal of Digital Humanities (IJDH) (2023)
work page 2023
-
[55]
Relative representations enable zero-shot latent space com- munication
Luca Moschella et al. “Relative representations enable zero-shot latent space com- munication”. In: The Eleventh International Conference on Learning Representa- tions. 2022
work page 2022
-
[56]
ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training
Antonio Norelli et al. “ASIF: Coupled Data Turns Unimodal Models to Multi- modal Without Training”. In: CoRR abs/2210.01738 (2022). DOI: 10 . 48550 / ARXIV.2210.01738. arXiv: 2210.01738
-
[57]
CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets
Zachary Novack et al. “CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets”. In: International Conference on Machine Learning, ICML 2023, 23- 29 July 2023, Honolulu, Hawaii, USA . Ed. by Andreas Krause et al. V ol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 26342–26362
work page 2023
-
[58]
Grounded Situation Recognition
Sarah Pratt et al. “Grounded Situation Recognition”. In: Computer Vision – ECCV
-
[59]
Ed. by Andrea Vedaldi et al. Lecture Notes in Computer Science. Springer. Cham: Springer International Publishing, 2020, pp. 314–332. ISBN : 978-3-030- 58548-8. DOI: 10.1007/978-3-030-58548-8\_19
-
[60]
Recognition using visual phrases
Mohammad Amin Sadeghi and Ali Farhadi. “Recognition using visual phrases”. In: Cvpr 2011. Ieee. 2011, pp. 1745–1752
work page 2011
-
[61]
Cristina Segalin, Dong Seon Cheng, and Marco Cristani. “Social Profiling through Image Understanding: Personality Inference Using Convolutional Neural Net- works”. In: Computer Vision and Image Understanding. Image and Video Under- standing in Big Data 156 (Mar. 2017), pp. 34–50. ISSN : 1077-3142. DOI: 10 . 1016/j.cviu.2016.10.013
work page 2017
-
[62]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015
work page 2015
-
[63]
Conceptnet 5.5: An open multi- lingual graph of general knowledge
Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multi- lingual graph of general knowledge”. In:Thirty-first AAAI Conference on Artificial Intelligence. 2017. 20 Martinez Pandiani et al. 2024 (Preprint) /
work page 2017
-
[64]
Mixture-Kernel Graph Attention Network for Situation Recognition
Mohammed Suhail and Leonid Sigal. “Mixture-Kernel Graph Attention Network for Situation Recognition”. In:2019 IEEE/CVF International Conference on Com- puter Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 10362–10371. ISBN : 978-1-72814-803-8. DOI: 10.1109/ICCV.2019.01046
-
[65]
A Domain Based Approach to Social Relation Recognition
Qianru Sun, Bernt Schiele, and Mario Fritz. “A Domain Based Approach to Social Relation Recognition”. In:2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE, July 2017, pp. 435–444. ISBN : 978-1- 5386-0457-1. DOI: 10.1109/CVPR.2017.54. (Visited on 01/19/2022)
-
[66]
Computer vision: algorithms and applications
Richard Szeliski. Computer vision: algorithms and applications. Springer Nature, 2022
work page 2022
-
[67]
Knowledge graphs as tools for explainable machine learning: A survey
Ilaria Tiddi and Stefan Schlobach. “Knowledge graphs as tools for explainable machine learning: A survey”. In: Artificial Intelligence 302 (2022), p. 103627
work page 2022
-
[68]
Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions
Antoine Toisoul et al. “Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions”. In:Nature Machine Intelligence3.1 (Jan. 2021), pp. 42–50. ISSN : 2522-5839. DOI: 10.1038/s42256-020-00280-0
-
[69]
Gabriella Vigliocco et al. “The representation of abstract words: What matters? Reply to Paivio’s (2013) comment on Kousta et al.(2011).” In: (2013)
work page 2013
-
[70]
Knowledge graph embedding: A survey of approaches and applications
Quan Wang et al. “Knowledge graph embedding: A survey of approaches and applications”. In: IEEE Transactions on Knowledge and Data Engineering 29.12 (2017), pp. 2724–2743
work page 2017
-
[71]
Understanding and Map- ping Natural Beauty
Scott Workman, Richard Souvenir, and Nathan Jacobs. “Understanding and Map- ping Natural Beauty”. In: 2017 IEEE International Conference on Computer Vi- sion (ICCV). Venice: IEEE, Oct. 2017, pp. 5590–5599.ISBN : 978-1-5386-1032-9. DOI: 10.1109/ICCV.2017.596
-
[72]
Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval
Xingxu Yao et al. “Attention-Aware Polarity Sensitive Embedding for Affective Image Retrieval”. In: 2019 IEEE/CVF International Conference on Computer Vi- sion (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1140–1150. ISBN : 978- 1-72814-803-8. DOI: 10.1109/ICCV.2019.00123
-
[73]
Situation Recognition: Visual Semantic Role Labeling for Image Understanding
Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation Recognition: Visual Semantic Role Labeling for Image Understanding”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Las Vegas, NV , USA: IEEE, June 2016, pp. 5534–5542. ISBN : 978-1-4673-8851-1. DOI: 10 . 1109 / CVPR . 2016.597
work page 2016
-
[74]
In: Conference on Robot Learning, pp
K. Ye and A. Kovashka. “ADVISE: Symbolism and External Knowledge for De- coding Advertisements”. In: Computer Vision – ECCV 2018 . Ed. by Vittorio Ferrari et al. V ol. 11219 LNCS. Cham: Springer International Publishing, 2018, pp. 868–886. ISBN : 9783030012663. DOI: 10.1007/978-3-030-01267-0\_51
-
[75]
Interpreting the Rhetoric of Visual Advertisements
Keren Ye et al. “Interpreting the Rhetoric of Visual Advertisements”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 43.4 (Apr. 2019), pp. 1308–1323. ISSN : 1939-3539. DOI: 10.1109/TPAMI.2019.2947440
-
[76]
Xiaohua Zhai et al. “Scaling Vision Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 . IEEE, 2022, pp. 1204–1213. DOI: 10 . 1109 / CVPR52688 . 2022.01179
-
[77]
Reasoning about object affordances in a knowledge base representation
Yuke Zhu, Alireza Fathi, and Li Fei-Fei. “Reasoning about object affordances in a knowledge base representation”. In: European conference on computer vision . Springer. 2014, pp. 408–424
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.