arXiv preprint arXiv:1901.06706 , year=

Visual entailment: A novel task for fine-grained image understanding , author= · 2019 · cs.CV · arXiv 1901.06706

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and outperforms several other state-of-the-art VQA based models. Finally, we demonstrate the explainability of EVE through cross-modal attention visualizations. The SNLI-VE dataset is publicly available at https://github.com/ necla-ml/SNLI-VE.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

CoCa: Contrastive Captioners are Image-Text Foundation Models

cs.CV · 2022-05-04 · accept · novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

citing papers explorer

Showing 3 of 3 citing papers.

CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 76 · internal anchor
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting cs.CV · 2026-04-20 · unverdicted · none · ref 50
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 153 · internal anchor
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

arXiv preprint arXiv:1901.06706 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer