arxiv: 1811.10582 · v2 · pith:GGZ7KRIYnew · submitted 2018-11-26 · 💻 cs.CV

Visual Entailment Task for Visually-Grounded Language Learning

Ning Xie , Farley Lai , Derek Doran , Asim Kadav This is my paper

classification 💻 cs.CV

keywords entailmentlanguagevisualsnli-vetasksdatasetinferenceintroduce

0 comments

read the original abstract

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.