Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images

Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel, Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky, Roy Schwartz · arXiv 2303.07274

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

Kosmos-2: Grounding Multimodal Large Language Models to the World

cs.CL · 2023-06-26 · unverdicted · novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

citing papers explorer

Showing 2 of 2 citing papers.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 6
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 2
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

Breaking common sense: WHOOPS! A vision-and-language benchmark of synthetic and compositional images

fields

years

verdicts

representative citing papers

citing papers explorer