pith. sign in

arxiv: 1906.09954 · v1 · pith:ZAXJWRZ5new · submitted 2019-06-24 · 💻 cs.CV · cs.AI

Integrating Knowledge and Reasoning in Image Understanding

Pith reviewed 2026-05-25 17:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image understandingknowledge integrationreasoning mechanismsdeep learningneural networksvisual question answeringexternal knowledgesemantic segmentation
0
0 comments X

The pith

Integrating external knowledge with neural networks and higher-level reasoning addresses limitations in data-driven image understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys representative reasoning mechanisms, knowledge integration methods, and their applications in tasks such as object recognition, semantic segmentation, and visual question answering. It identifies the absence of external knowledge and higher-level reasoning as a key hindrance in current deep learning approaches for image understanding. By reviewing efforts from various research groups, the work highlights concrete ways neural networks can be combined with knowledge sources. A reader would care because this points to practical routes for making systems more capable when training data alone proves insufficient. The survey ends by outlining potential pathways forward based on the reviewed methods.

Core claim

Deep learning based data-driven approaches have succeeded in image understanding applications but still lack knowledge integration as well as higher-level reasoning capabilities. This work presents a brief survey of representative reasoning mechanisms, knowledge integration methods, and corresponding applications. It further discusses key efforts on integrating external knowledge with neural networks and concludes by discussing potential pathways to improve reasoning capabilities.

What carries the argument

Survey of reasoning mechanisms and methods for integrating external knowledge with neural networks in image understanding tasks.

If this is right

  • Visual question answering and similar tasks can draw on external knowledge bases to handle cases beyond what training data covers.
  • Combining neural networks with structured knowledge sources yields concrete performance gains in image understanding.
  • Multiple distinct approaches to reasoning integration already exist and can be built upon.
  • Future image understanding systems will require explicit pathways for incorporating higher-level reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar integration strategies could apply to other perception tasks where pure pattern matching fails on novel inputs.
  • Structured knowledge graphs might serve as a modular add-on rather than requiring full retraining of networks.
  • Evaluating integrated systems on out-of-distribution images would provide a direct test of the reasoning benefit.

Load-bearing premise

The selected representative papers and methods provide a balanced and sufficiently complete view of the field.

What would settle it

An experiment showing that purely data-driven methods without external knowledge or explicit reasoning achieve equal or better results than the surveyed integrated approaches on standard image understanding benchmarks would undermine the claimed hindrance.

Figures

Figures reproduced from arXiv: 1906.09954 by Chitta Baral, Somak Aditya, Yezhou Yang.

Figure 1
Figure 1. Figure 1: The diagram shows the information hierarchy for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Example of questions that require explicit external knowledge [35], (b) Example where knowledge helps [37]. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Deep learning based data-driven approaches have been successfully applied in various image understanding applications ranging from object recognition, semantic segmentation to visual question answering. However, the lack of knowledge integration as well as higher-level reasoning capabilities with the methods still pose a hindrance. In this work, we present a brief survey of a few representative reasoning mechanisms, knowledge integration methods and their corresponding image understanding applications developed by various groups of researchers, approaching the problem from a variety of angles. Furthermore, we discuss upon key efforts on integrating external knowledge with neural networks. Taking cues from these efforts, we conclude by discussing potential pathways to improve reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This manuscript is a brief survey claiming that purely data-driven deep learning methods for image understanding tasks (object recognition, semantic segmentation, visual question answering) are limited by lack of knowledge integration and higher-level reasoning; it reviews a few representative reasoning mechanisms and knowledge-integration approaches from the literature, discusses key efforts to combine external knowledge with neural networks, and outlines potential pathways for improvement.

Significance. If the selected examples accurately reflect the state of the field, the survey could usefully synthesize existing work and highlight directions for moving beyond purely data-driven image understanding; the manuscript does not advance new derivations, proofs, or empirical results.

major comments (1)
  1. [Abstract] Abstract: the central synthesis claim rests on the representativeness of the 'few' selected methods, yet the text provides no explicit selection criteria, coverage of omitted lines of work, or discussion of potential selection bias; this directly affects the load-bearing assumption that the reviewed efforts constitute a balanced view.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'discuss upon key efforts' is nonstandard and should be revised for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestion. The concern about explicit selection criteria in the abstract is well-taken for a survey paper, and we will revise accordingly to strengthen the presentation of scope and balance.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central synthesis claim rests on the representativeness of the 'few' selected methods, yet the text provides no explicit selection criteria, coverage of omitted lines of work, or discussion of potential selection bias; this directly affects the load-bearing assumption that the reviewed efforts constitute a balanced view.

    Authors: We agree this is a valid point for improving clarity in a brief survey. In the revised manuscript we will (1) expand the abstract to state the selection criteria (recent works integrating external knowledge or symbolic reasoning with neural networks for image understanding tasks, chosen to illustrate diverse mechanisms across object recognition, segmentation, and VQA), (2) add a short paragraph in the introduction explicitly noting the scope, key omitted lines of work (e.g., purely symbolic systems, large-scale pre-training without explicit knowledge bases, and reinforcement-learning-only reasoning), and (3) include a brief limitations statement on potential selection bias. These changes will be confined to the front matter and will not alter the core reviewed content. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a brief survey paper with no derivations, equations, fitted parameters, predictions, or technical claims that could reduce to self-definition or self-citation. The central claim is a high-level synthesis of existing literature on knowledge integration and reasoning in image understanding, supported by references to external work rather than any new proof or measurement whose validity depends on the paper's own inputs. No load-bearing steps exist that match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper; it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5623 in / 872 out tokens · 16470 ms · 2026-05-25T17:36:54.774660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 2 internal anchors

  1. [1]

    Spatial knowledge distillation to aid visual rea- soning

    Somak Aditya, Rudra Saha, Yezhou Yang, and Chitta Baral. Spatial knowledge distillation to aid visual rea- soning. IEEE Winter Conference on Applications of Computer Vision (WACV), pages 227–235, 2019

  2. [2]

    Ex- plicit Reasoning over End-to-End Neural Architectures for Visual Question Answering

    Somak Aditya, Yezhou Yang, and Chitta Baral. Ex- plicit Reasoning over End-to-End Neural Architectures for Visual Question Answering. In AAAI, pages 629– 637, 2018

  3. [3]

    Combining knowledge and reasoning through probabilistic soft logic for image puzzle solv- ing

    Somak Aditya, Yezhou Yang, Chitta Baral, and Yian- nis Aloimonos. Combining knowledge and reasoning through probabilistic soft logic for image puzzle solv- ing. In UAI 2018, pages 238–248. Association For Un- certainty in Artificial Intelligence (AUAI), 2018

  4. [4]

    Image understand- ing using vision and reasoning through scene descrip- tion graph

    Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos, and Cornelia Fermller. Image understand- ing using vision and reasoning through scene descrip- tion graph. Computer Vision and Image Understanding, pages 33–45, 2017

  5. [5]

    The descrip- tion logic handbook: Theory, implementation and ap- plications

    Franz Baader, Diego Calvanese, Deborah McGuinness, Peter Patel-Schneider, and Daniele Nardi. The descrip- tion logic handbook: Theory, implementation and ap- plications. Cambridge university press, 2003

  6. [6]

    Hinge-loss markov random fields and probabilistic soft logic

    Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. Hinge-loss markov random fields and probabilistic soft logic. Journal of Machine Learning Research, 18:1–67, 2017

  7. [7]

    Murel: Multimodal Relational Rea- soning for Visual Question Answering

    Remi Cadene, Hedi Ben-Younes, Nicolas Thome, and Matthieu Cord. Murel: Multimodal Relational Rea- soning for Visual Question Answering. In IEEE Con- ference on Computer Vision and Pattern Recognition CVPR, 2019

  8. [8]

    Applying fuzzy dls in the extrac- tion of image semantics

    Stamatia Dasiopoulou, Ioannis Kompatsiaris, and Michael G Strintzis. Applying fuzzy dls in the extrac- tion of image semantics. In Journal on Data Semantics XIV, pages 105–132. Springer, 2009

  9. [9]

    Commonsense rea- soning and commonsense knowledge in artificial intel- ligence

    Ernest Davis and Gary Marcus. Commonsense rea- soning and commonsense knowledge in artificial intel- ligence. Commun. ACM, 58(9):92–103, August 2015

  10. [10]

    Applying semantic reasoning in image re- trieval

    Maaike de Boer, Laura Daniele, Paul Brandt, and Maya Sappelli. Applying semantic reasoning in image re- trieval. Proc. ALLDATA, 2015

  11. [11]

    Problog: A probabilistic prolog and its applica- tion in link discovery

    Luc De Raedt, Angelika Kimmig, and Hannu Toivo- nen. Problog: A probabilistic prolog and its applica- tion in link discovery. In Proceedings of the 20th In- ternational Joint Conference on Artifical Intelligence , IJCAI’07, pages 2468–2473, San Francisco, CA, USA,

  12. [12]

    Morgan Kaufmann Publishers Inc

  13. [13]

    Observing human-object interactions: Using spatial and functional compatibility for recognition

    Abhinav Gupta, Aniruddha Kembhavi, and Larry S Davis. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine In- telligence, 31(10):1775–1789, 2009

  14. [14]

    Conceptnet 3: a flexible, multilingual semantic net- work for common sense knowledge

    Catherine Havasi, Robert Speer, and Jason Alonso. Conceptnet 3: a flexible, multilingual semantic net- work for common sense knowledge. InRecent advances in natural language processing, pages 27–29. Citeseer, 2007

  15. [15]

    Distill- ing the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network. stat, 1050:9, 2015

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compo- sitional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  17. [17]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Gir- shick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017

  18. [18]

    Image retrieval us- ing scene graphs

    Justin Johnson, Ranjay Krishna, Michael Stark, Jia Li, Michael Bernstein, and Li Fei-Fei. Image retrieval us- ing scene graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 3668– 3678, June 2015

  19. [19]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John- son, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yan- nis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision, 123(1):32–73, May 2017

  20. [20]

    Exploiting language models for visual recognition

    Dieu-Thu Le, Jasper Uijlings, and Raffaella Bernardi. Exploiting language models for visual recognition. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages 769– 779, 2013

  21. [21]

    Comput- ing lp mln using asp and mln solvers

    Joohyung Lee, Samidh Talsania, and Yi Wang. Comput- ing lp mln using asp and mln solvers. Theory and Prac- tice of Logic Programming, 17(5-6):942–960, 2017

  22. [22]

    Douglas B. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Commun. ACM, 38(11):33– 38, November 1995

  23. [23]

    Collective activ- ity detection using hinge-loss markov random fields

    Ben London, Sameh Khamis, Stephen Bach, Bert Huang, Lise Getoor, and Larry Davis. Collective activ- ity detection using hinge-loss markov random fields. In Proceedings of the IEEE CVPR Workshops, pages 566– 571, 2013

  24. [24]

    Deep- problog: Neural probabilistic logic programming

    Robin Manhaeve, Sebastijan Dumancic, Angelika Kim- mig, Thomas Demeester, and Luc De Raedt. Deep- problog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems , pages 3753–3763, 2018

  25. [25]

    The more you know: Using knowledge graphs for image classification

    Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. The more you know: Using knowledge graphs for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2673–2681, 2017

  26. [26]

    George A. Miller. Wordnet: A lexical database for en- glish. Commun. ACM, 38(11):39–41, November 1995

  27. [27]

    Randell, Zhan Cui, and Anthony G

    David A. Randell, Zhan Cui, and Anthony G. Cohn. A spatial logic based on regions and connection. In Pro- ceedings 3rd International Conference ON Knowledge Representation And Reasoning, 1992

  28. [28]

    Markov logic networks

    Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning , 62(1-2):107–136, 2006

  29. [29]

    End-to-end dif- ferentiable proving

    Tim Rockt¨aschel and Sebastian Riedel. End-to-end dif- ferentiable proving. In Advances in Neural Information Processing Systems, pages 3788–3800, 2017

  30. [30]

    Kvqa: Knowledge-aware visual question answering

    Naganand Yadati Sanket Shah, Anand Mishra and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In AAAI, 2019

  31. [31]

    A simple neural network module for relational reasoning

    Adam Santoro, David Raposo, David G Barrett, Ma- teusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In NIPS, pages 4967–4976, 2017

  32. [32]

    Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge

    Luciano Serafini and Artur d’Avila Garcez. Logic ten- sor networks: Deep learning and logical reasoning from data and knowledge. arXiv preprint arXiv:1606.04422, 2016

  33. [33]

    The geometry of a scene: On deep semantics for visual perception driven cognitive film, studies

    Jakob Suchan and Mehul Bhatt. The geometry of a scene: On deep semantics for visual perception driven cognitive film, studies. In2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–

  34. [34]

    Suchanek, Gjergji Kasneci, and Gerhard Weikum

    Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A core of semantic knowledge. In Pro- ceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 697–706, New York, NY , USA, 2007. ACM

  35. [35]

    Using a minimal action grammar for activity understanding in the real world

    Douglas Summers-Stay, Ching L Teo, Yezhou Yang, Cornelia Ferm ¨uller, and Yiannis Aloimonos. Using a minimal action grammar for activity understanding in the real world. In 2012 IEEE/RSJ International Con- ference on Intelligent Robots and Systems, pages 4104–

  36. [36]

    Fvqa: fact-based visual question answering

    Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Fvqa: fact-based visual question answering. IEEE TPAMI, 2017

  37. [37]

    Visual question answering: A survey of methods and datasets

    Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017

  38. [38]

    Ask me anything: Free-form vi- sual question answering based on knowledge from ex- ternal sources

    Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. Ask me anything: Free-form vi- sual question answering based on knowledge from ex- ternal sources. In IEEE Conference on Computer Vi- sion and Pattern Recognition CVPR, pages 4622–4630, 2016

  39. [39]

    Incorporating Human Domain Knowledge into Large Scale Cost Function Learning

    Markus Wulfmeier, Dushyant Rao, and Ingmar Pos- ner. Incorporating human domain knowledge into large scale cost function learning. arXiv preprint arXiv:1612.04318, 2016

  40. [40]

    Morariu, and Larry S

    Ruichi Yu, Ang Li, Vlad I. Morariu, and Larry S. Davis. Visual Relationship Detection with Internal and Exter- nal Linguistic Knowledge Distillation. ICCV, 2017

  41. [41]

    Scene understanding by reasoning sta- bility and safety

    Bo Zheng, Yibiao Zhao, Joey Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Scene understanding by reasoning sta- bility and safety. International Journal of Computer Vi- sion, 112(2):221–238, 2015

  42. [42]

    Temporal relational reasoning in videos

    Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In ECCV, September 2018

  43. [43]

    Reasoning about object affordances in a knowledge base represen- tation

    Yuke Zhu, Alireza Fathi, and Li Fei-Fei. Reasoning about object affordances in a knowledge base represen- tation. In ECCV (2), pages 408–424. Springer, 2014