The multimodal projection retains positional information mainly at the four corners, effectively marking the image boundaries needed to infer its dimensions

Positional Information 0 8 16 23 0 8 16 23 Position Projection (LLaVA-7B/13B) 0 8 16 23 0 8 16 23 Layer 13 (LLaVA-7B) 0 8 16 23 0 8 16 23 Layer 12 (LLaVA-13B) 0 8 15 Position 0 8 1

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Mechanisms of Object Localization in Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.

citing papers explorer

Showing 1 of 1 citing paper.

Mechanisms of Object Localization in Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 42
Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.

The multimodal projection retains positional information mainly at the four corners, effectively marking the image boundaries needed to infer its dimensions

fields

years

verdicts

representative citing papers

citing papers explorer