pith. machine review for the scientific record. sign in

arxiv: 1811.10582 · v2 · pith:GGZ7KRIYnew · submitted 2018-11-26 · 💻 cs.CV

Visual Entailment Task for Visually-Grounded Language Learning

classification 💻 cs.CV
keywords entailmentlanguagevisualsnli-vetasksdatasetinferenceintroduce
0
0 comments X
read the original abstract

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Routing-Based Continual Learning for Multimodal Large Language Models

    cs.LG 2025-11 unverdicted novelty 6.0

    Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.