From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge

Somak Aditya , Yezhou Yang , Chitta Baral , Cornelia Fermuller , Yiannis Aloimonos

Authors on Pith no claims yet

classification 💻 cs.CV cs.AIcs.CL

keywords knowledgecommonsenseconstructeddescriptionimageimagesreasoningsdgs

read the original abstract

In this paper we propose the construction of linguistic descriptions of images. This is achieved through the extraction of scene description graphs (SDGs) from visual scenes using an automatically constructed knowledge base. SDGs are constructed using both vision and reasoning. Specifically, commonsense reasoning is applied on (a) detections obtained from existing perception methods on given images, (b) a "commonsense" knowledge base constructed using natural language processing of image annotations and (c) lexical ontological knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most cases, sentences auto-constructed from SDGs obtained by our method give a more relevant and thorough description of an image than a recent state-of-the-art image caption based approach. Our Image-Sentence Alignment Evaluation results are also comparable to that of the recent state-of-the art approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLIPScore: A Reference-free Evaluation Metric for Image Captioning
cs.CV 2021-04 conditional novelty 8.0

CLIPScore uses a web-pretrained CLIP model to evaluate image captions without references and achieves higher human correlation than CIDEr or SPICE.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation
cs.CV 2026-05 unverdicted novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
MLLM-as-a-Judge Exhibits Model Preference Bias
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.