pith. machine review for the scientific record. sign in

arxiv: 1811.00491 · v3 · submitted 2018-11-01 · 💻 cs.CL · cs.CV

Recognition: unknown

A Corpus for Reasoning About Natural Language Grounded in Photographs

Authors on Pith no claims yet
classification 💻 cs.CL cs.CV
keywords reasoningdatalanguagenaturalphotographsimagesjointtask
0
0 comments X
read the original abstract

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  2. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  3. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

  4. ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

    cs.CV 2026-04 unverdicted novelty 5.0

    ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.

  5. WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 5.0

    WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.