What’s in the image? what survives in modern multimodal lms

Yanzhe Zhang · 2024 · arXiv 2411.17491

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

citing papers explorer

Showing 1 of 1 citing paper.

Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 18
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

What’s in the image? what survives in modern multimodal lms

fields

years

verdicts

representative citing papers

citing papers explorer